{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Datawhale 智慧海洋建设-Task3 特征工程"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "此部分为智慧海洋建设竞赛的特征工程模块，通过特征工程，可以最大限度地从原始数据中提取特征以供算法和模型使用。通俗而言，就是通过X，创造新的X'以获得更好的训练、预测效果。\n",
    "\n",
    "“数据和特征决定了机器学习的上限，而模型和算法只是逼近这个上限而已”——机器学习界；\n",
    "\n",
    "类似的，吴恩达曾说过：“特征工程不仅操作困难、耗时，而且需要专业领域知识。应用机器学习基本上就是特征工程。”\n",
    "\n",
    "\n",
    "赛题：智慧海洋建设\n",
    "\n",
    "特征工程的目的:\n",
    "\n",
    "- 特征工程是一个包含内容很多的主题，也被认为是成功应用机器学习的一个很重要的环节。如何充分利用数据进行预测建模就是特征工程要解决的问题！ “实际上，所有机器学习算法的成功取决于如何呈现数据。” “特征工程是一个看起来不值得在任何论文或者书籍中被探讨的一个主题。但是他却对机器学习的成功与否起着至关重要的作用。机器学习算法很多都是由于建立一个学习器能够理解的工程化特征而获得成功的。”——ScottLocklin，in “Neglected machine learning ideas”\n",
    "\n",
    "\n",
    "- 数据中的特征对预测的模型和获得的结果有着直接的影响。可以这样认为，特征选择和准备越好，获得的结果也就越好。这是正确的，但也存在误导。预测的结果其实取决于许多相关的属性：比如说能获得的数据、准备好的特征以及模型的选择。\n",
    "\n",
    "\n",
    "- 上分！:) 毫不夸张的说在基本的数据挖掘类比赛中，特征工程就是你和topline的距离。\n",
    "\n",
    "项目地址：https://github.com/datawhalechina/team-learning-data-mining/tree/master/wisdomOcean\n",
    "\n",
    "\n",
    "比赛地址：https://tianchi.aliyun.com/competition/entrance/231768/introduction?spm=5176.12281957.1004.8.4ac63eafE1rwsY"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 学习目标"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. 学习特征工程的基本概念\n",
    "\n",
    "\n",
    "2. 学习topline代码的特征工程构造方法，实现构建有意义的特征工程\n",
    "\n",
    "\n",
    "3. 完成相应学习打卡任务"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 内容介绍"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "0. 特征工程概述\n",
    "\n",
    "1. 赛题特征工程\n",
    "    - 业务特征，根据先验知识进行专业性的特征构建\n",
    "2. 分箱特征\n",
    "    - v、x、y的分箱特征\n",
    "    - x、y分箱后并构造区域\n",
    "3. DataFramte特征\n",
    "    - count计数值\n",
    "    - shift偏移量\n",
    "    - 统计特征\n",
    "4. Embedding特征\n",
    "    - Word2vec构造词向量\n",
    "    - NMF提取文本的主题分布\n",
    "5. 总结"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 特征工程概述"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "特征工程大体可分为3部分，特征构建、特征提取和特征选择。\n",
    "\n",
    "- 特征构建\n",
    "\n",
    "“从数学的角度讲，特征工程就是将原始数据空间变换到新的特征空间，或者说是换一种数据的表达方式，在新的特征空间中，模型能够更好地学习数据中的规律。因此，特征抽取就是对原始数据进行变换的过程。大多数模型和算法都要求输入是维度相同的实向量，因此特征工程首先需要将原始数据转化为实向量。”\n",
    "其主要包含内容有：\n",
    "\n",
    "    + 探索性数据分析\n",
    "    + 数值特征\n",
    "    + 类别特征\n",
    "    + 时间特征\n",
    "    + 文本特征\n",
    "\n",
    "- 特征提取和特征选择\n",
    "\n",
    "特征提取和特征选择概念上来说很像，其实特征提取指的是通过特征转换得到一组具有明显物理或统计意义的特征。而特征选择就是在特征集里直接挑出具有明显物理或统计意义的特征。\n",
    "\n",
    "与特征提取是从原始数据中构造新的特征不同，特征选择是从这些特征集合中选出一个子集。特征选择对于机器学习应用来说非常重要。特征选择也称为属性选择或变量选择，是指为了构建模型而选择相关特征子集的过程。特征选择的目的有如下三个。\n",
    "\n",
    "    + 简化模型，使模型更易于研究人员和用户理解。可解释性不仅让我们对模型效果的稳定性有更多的把握，而且也能为业务运营等工作提供指引和决策支持。\n",
    "\n",
    "    +  改善性能。特征选择的另一个作用是节省存储和计算开销。\n",
    "\n",
    "    +  改善通用性、降低过拟合风险。特征的增多会大大增加模型的搜索空间，大多数模型所需要的训练样本数目随着特征数量的增加而显著增加，特征的增加虽然能更好地拟合训练数据，但也可能增加方差。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "————————————————————————————————————————————————————————————————————"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "注：本ipynb着重学习topline代码的特征工程构造方法，效果需要模型方面进行预测打分"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "————————————————————————————————————————————————————————————————————"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "导入所需库和数据\n",
    "\n",
    "补充：\n",
    "下述库中的geopandas安装可能会遇到问题，可通过如下博客解决：\n",
    "\n",
    "https://qianni1997.github.io/2019/07/26/geopandas-install/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-04-06T09:40:44.860521Z",
     "start_time": "2021-04-06T09:40:29.681465Z"
    }
   },
   "outputs": [],
   "source": [
    "import gc\n",
    "import multiprocessing as mp\n",
    "import os\n",
    "import pickle\n",
    "import time\n",
    "import warnings\n",
    "from collections import Counter\n",
    "from copy import deepcopy\n",
    "from datetime import datetime\n",
    "from functools import partial\n",
    "from glob import glob\n",
    "\n",
    "import geopandas as gpd\n",
    "import lightgbm as lgb\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import seaborn as sns\n",
    "from gensim.models import FastText, Word2Vec\n",
    "from gensim.models.doc2vec import Doc2Vec, TaggedDocument\n",
    "from pyproj import Proj\n",
    "from scipy import sparse\n",
    "from scipy.sparse import csr_matrix\n",
    "from sklearn import metrics\n",
    "from sklearn.cluster import DBSCAN\n",
    "from sklearn.decomposition import NMF, TruncatedSVD\n",
    "from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer\n",
    "from sklearn.metrics import f1_score, precision_recall_fscore_support\n",
    "from sklearn.model_selection import StratifiedKFold\n",
    "from sklearn.preprocessing import LabelEncoder\n",
    "from tqdm import tqdm\n",
    "\n",
    "os.environ['PYTHONHASHSEED'] = '0'\n",
    "warnings.filterwarnings('ignore')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-04-06T09:40:45.155446Z",
     "start_time": "2021-04-06T09:40:44.861521Z"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "  0%|                                                                                         | 0/7000 [00:00<?, ?it/s]\n"
     ]
    }
   ],
   "source": [
    "# 不直接对DataFrame做append操作，提升运行速度\n",
    "def get_data(file_path,max_lines = 2000):\n",
    "    paths = os.listdir(file_path)\n",
    "    tmp = []\n",
    "    for t in tqdm(range(len(paths))):\n",
    "        if len(tmp) > max_lines:break\n",
    "            \n",
    "        p = paths[t]\n",
    "        with open('{}/{}'.format(file_path, p), encoding='utf-8') as f:\n",
    "            next(f)\n",
    "            for line in f.readlines():\n",
    "                tmp.append(line.strip().split(','))\n",
    "                if len(tmp) > max_lines:break\n",
    "                    \n",
    "    tmp_df = pd.DataFrame(tmp)\n",
    "    tmp_df.columns = ['渔船ID', 'x', 'y', '速度', '方向', 'time', 'type']\n",
    "    return tmp_df\n",
    "\n",
    "TRAIN_PATH = \"../input/hy_round1_train_20200102/\"\n",
    "# 采样数据行数\n",
    "max_lines = 2000\n",
    "df = get_data(TRAIN_PATH,max_lines=max_lines)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-04-06T09:40:45.217623Z",
     "start_time": "2021-04-06T09:40:45.157392Z"
    },
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>x</th>\n",
       "      <th>y</th>\n",
       "      <th>v</th>\n",
       "      <th>dir</th>\n",
       "      <th>time</th>\n",
       "      <th>label</th>\n",
       "      <th>date</th>\n",
       "      <th>hour</th>\n",
       "      <th>month</th>\n",
       "      <th>weekday</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>6.152038e+06</td>\n",
       "      <td>5.124873e+06</td>\n",
       "      <td>2.59</td>\n",
       "      <td>102</td>\n",
       "      <td>1900-11-10 11:58:19</td>\n",
       "      <td>0</td>\n",
       "      <td>1900-11-10</td>\n",
       "      <td>11</td>\n",
       "      <td>11</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>6.151230e+06</td>\n",
       "      <td>5.125218e+06</td>\n",
       "      <td>2.70</td>\n",
       "      <td>113</td>\n",
       "      <td>1900-11-10 11:48:19</td>\n",
       "      <td>0</td>\n",
       "      <td>1900-11-10</td>\n",
       "      <td>11</td>\n",
       "      <td>11</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>6.150421e+06</td>\n",
       "      <td>5.125563e+06</td>\n",
       "      <td>2.70</td>\n",
       "      <td>116</td>\n",
       "      <td>1900-11-10 11:38:19</td>\n",
       "      <td>0</td>\n",
       "      <td>1900-11-10</td>\n",
       "      <td>11</td>\n",
       "      <td>11</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "      <td>6.149612e+06</td>\n",
       "      <td>5.125907e+06</td>\n",
       "      <td>3.29</td>\n",
       "      <td>95</td>\n",
       "      <td>1900-11-10 11:28:19</td>\n",
       "      <td>0</td>\n",
       "      <td>1900-11-10</td>\n",
       "      <td>11</td>\n",
       "      <td>11</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>6.148803e+06</td>\n",
       "      <td>5.126252e+06</td>\n",
       "      <td>3.18</td>\n",
       "      <td>108</td>\n",
       "      <td>1900-11-10 11:18:19</td>\n",
       "      <td>0</td>\n",
       "      <td>1900-11-10</td>\n",
       "      <td>11</td>\n",
       "      <td>11</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  id             x             y     v  dir                time  label  \\\n",
       "0  0  6.152038e+06  5.124873e+06  2.59  102 1900-11-10 11:58:19      0   \n",
       "1  0  6.151230e+06  5.125218e+06  2.70  113 1900-11-10 11:48:19      0   \n",
       "2  0  6.150421e+06  5.125563e+06  2.70  116 1900-11-10 11:38:19      0   \n",
       "3  0  6.149612e+06  5.125907e+06  3.29   95 1900-11-10 11:28:19      0   \n",
       "4  0  6.148803e+06  5.126252e+06  3.18  108 1900-11-10 11:18:19      0   \n",
       "\n",
       "         date  hour  month  weekday  \n",
       "0  1900-11-10    11     11        5  \n",
       "1  1900-11-10    11     11        5  \n",
       "2  1900-11-10    11     11        5  \n",
       "3  1900-11-10    11     11        5  \n",
       "4  1900-11-10    11     11        5  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 基本预处理\n",
    "label_dict1 = {'拖网': 0, '围网': 1, '刺网': 2}\n",
    "label_dict2 = {0: '拖网', 1: '围网', 2: '刺网'}\n",
    "name_dict = {'渔船ID': 'id', '速度': 'v', '方向': 'dir', 'type': 'label'}\n",
    "\n",
    "df.rename(columns = name_dict, inplace = True)\n",
    "df['label'] = df['label'].map(label_dict1)\n",
    "cols = ['x','y','v']\n",
    "for col in cols:\n",
    "    df[col] = df[col].astype('float')\n",
    "df['dir'] = df['dir'].astype('int')\n",
    "df['time'] = pd.to_datetime(df['time'], format='%m%d %H:%M:%S')\n",
    "df['date'] = df['time'].dt.date\n",
    "df['hour'] = df['time'].dt.hour\n",
    "df['month'] = df['time'].dt.month\n",
    "df['weekday'] = df['time'].dt.weekday\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "数据说明：\n",
    "\n",
    "    - id：渔船ID，整数\n",
    "    - x：记录位置横坐标，浮点数\n",
    "    - y：记录位置纵坐标，浮点数\n",
    "    - v：记录速度，浮点数\n",
    "    - dir：记录航向，整数\n",
    "    - time：时间，文本\n",
    "    - label：需要预测的标签，整数"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 赛题特征工程"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 构造各点的(x、y)坐标与特定点(6165599,5202660)的距离"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-04-06T09:40:51.254522Z",
     "start_time": "2021-04-06T09:40:51.223636Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    78959.780945\n",
       "1    78763.845006\n",
       "2    78577.185266\n",
       "3    78399.867568\n",
       "4    78231.955018\n",
       "Name: base_dis_diff, dtype: float64"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['x_dis_diff'] = (df['x'] - 6165599).abs()\n",
    "df['y_dis_diff'] = (df['y'] - 5202660).abs()\n",
    "df['base_dis_diff'] = ((df['x_dis_diff']**2)+(df['y_dis_diff']**2))**0.5    \n",
    "del df['x_dis_diff'],df['y_dis_diff'] \n",
    "df['base_dis_diff'].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 对时间，小时进行白天、黑天进行划分，5-20为白天1，其余为黑天0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-04-06T09:40:52.721776Z",
     "start_time": "2021-04-06T09:40:52.696829Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    1\n",
       "1    1\n",
       "2    1\n",
       "3    1\n",
       "4    1\n",
       "Name: day_nig, dtype: int64"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['day_nig'] = 0\n",
    "df.loc[(df['hour'] > 5) & (df['hour'] < 20),'day_nig'] = 1\n",
    "df['day_nig'].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 根据月份划分季度"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-04-06T09:40:54.053897Z",
     "start_time": "2021-04-06T09:40:54.030942Z"
    }
   },
   "outputs": [],
   "source": [
    "# 季度\n",
    "df['quarter'] = 0\n",
    "df.loc[(df['month'].isin([1, 2, 3])), 'quarter'] = 1\n",
    "df.loc[(df['month'].isin([4, 5, 6, ])), 'quarter'] = 2\n",
    "df.loc[(df['month'].isin([7, 8, 9])), 'quarter'] = 3\n",
    "df.loc[(df['month'].isin([10, 11, 12])), 'quarter'] = 4"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 动态速度，速度变化，角度变化，xy相似性等特征"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-04-06T09:40:55.098791Z",
     "start_time": "2021-04-06T09:40:55.062887Z"
    }
   },
   "outputs": [],
   "source": [
    "temp = df.copy()\n",
    "temp.rename(columns={'id':'ship','dir':'d'},inplace=True)\n",
    "\n",
    "# 给速度一个等级\n",
    "def v_cut(v):\n",
    "    if v < 0.1:\n",
    "        return 0\n",
    "    elif v < 0.5:\n",
    "        return 1\n",
    "    elif v < 1:\n",
    "        return 2\n",
    "    elif v < 2.5:\n",
    "        return 3\n",
    "    elif v < 5:\n",
    "        return 4\n",
    "    elif v < 10:\n",
    "        return 5\n",
    "    elif v < 20:\n",
    "        return 5\n",
    "    else:\n",
    "        return 6\n",
    "# 统计每个ship的对应速度等级的个数\n",
    "def get_v_fea(df):\n",
    "\n",
    "    df['v_cut'] = df['v'].apply(lambda x: v_cut(x))\n",
    "    tmp = df.groupby(['ship', 'v_cut'], as_index=False)['v_cut'].agg({'v_cut_count': 'count'})\n",
    "    # 通过pivot构建透视表\n",
    "    tmp = tmp.pivot(index='ship', columns='v_cut', values='v_cut_count')\n",
    "\n",
    "    new_col_nm = ['v_cut_' + str(col) for col in tmp.columns.tolist()]\n",
    "    tmp.columns = new_col_nm\n",
    "    tmp = tmp.reset_index()  # 把index恢复成data\n",
    "\n",
    "    return tmp\n",
    "\n",
    "c1 = get_v_fea(temp)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-04-06T09:40:56.796042Z",
     "start_time": "2021-04-06T09:40:56.769114Z"
    }
   },
   "outputs": [],
   "source": [
    "# 方位进行16均分\n",
    "def add_direction(df):\n",
    "    df['d16'] = df['d'].apply(lambda x: int((x / 22.5) + 0.5) % 16 if not np.isnan(x) else np.nan)\n",
    "    return df\n",
    "def get_d_cut_count_fea(df):\n",
    "    df = add_direction(df)\n",
    "    tmp = df.groupby(['ship', 'd16'], as_index=False)['d16'].agg({'d16_count': 'count'})\n",
    "    tmp = tmp.pivot(index='ship', columns='d16', values='d16_count')\n",
    "    new_col_nm = ['d16_' + str(col) for col in tmp.columns.tolist()]\n",
    "    tmp.columns = new_col_nm\n",
    "    tmp = tmp.reset_index()\n",
    "    return tmp\n",
    "\n",
    "c2 = get_d_cut_count_fea(temp)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-04-06T09:40:57.574641Z",
     "start_time": "2021-04-06T09:40:57.539739Z"
    }
   },
   "outputs": [],
   "source": [
    "def get_v0_fea(df):\n",
    "    # 统计速度为0的个数，以及速度不为0的统计量\n",
    "    df_zero_count = df.query(\"v==0\")[['ship', 'v']].groupby('ship', as_index=False)['v'].agg(\n",
    "        {'num_zero_v': 'count'})\n",
    "    df_not_zero_agg = df.query(\"v!=0\")[['ship', 'v']].groupby('ship', as_index=False)['v'].agg(\n",
    "        {'v_max_drop_0': 'max',\n",
    "         'v_min_drop_0': 'min',\n",
    "         'v_mean_drop_0': 'mean',\n",
    "         'v_std_drop_0': 'std',\n",
    "         'v_median_drop_0': 'median',\n",
    "         'v_skew_drop_0': 'skew'})\n",
    "    tmp = df_zero_count.merge(df_not_zero_agg, on='ship', how='left')\n",
    "\n",
    "    return tmp\n",
    "\n",
    "c3 = get_v0_fea(temp)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-04-06T09:40:58.057987Z",
     "start_time": "2021-04-06T09:40:57.967114Z"
    }
   },
   "outputs": [],
   "source": [
    "def get_percentiles_fea(df_raw):\n",
    "    key = ['x', 'y', 'v', 'd']\n",
    "    temp = df_raw[['ship']].drop_duplicates('ship')\n",
    "    for i in range(len(key)):\n",
    "        # 加入x，v，d，y的中位数和各种位数\n",
    "        tmp_dscb = df_raw.groupby('ship')[key[i]].describe(\n",
    "            percentiles=[0.05] + [ii / 1000 for ii in range(125, 1000, 125)] + [0.95])\n",
    "        raw_col_nm = tmp_dscb.columns.tolist()\n",
    "        new_col_nm = [key[i] + '_' + col for col in raw_col_nm]\n",
    "        tmp_dscb.columns = new_col_nm\n",
    "        tmp_dscb = tmp_dscb.reset_index()\n",
    "        # 删掉多余的统计特征\n",
    "        tmp_dscb = tmp_dscb.drop([f'{key[i]}_count', f'{key[i]}_mean', f'{key[i]}_std',\n",
    "                                  f'{key[i]}_min', f'{key[i]}_max'], axis=1)\n",
    "\n",
    "        temp = temp.merge(tmp_dscb, on='ship', how='left')\n",
    "    return temp\n",
    "\n",
    "c4 = get_percentiles_fea(temp)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-04-06T09:40:58.605497Z",
     "start_time": "2021-04-06T09:40:58.425813Z"
    }
   },
   "outputs": [],
   "source": [
    "def get_d_change_rate_fea(df):\n",
    "    import math\n",
    "    import time\n",
    "    temp = df.copy()\n",
    "    # 以ship、time为主键进行排序\n",
    "    temp.sort_values(['ship', 'time'], ascending=True, inplace=True)\n",
    "    # 通过shift求相邻差异值，注意学习.shift(-1,1)的含义\n",
    "    temp['timenext'] = temp.groupby('ship')['time'].shift(-1)\n",
    "    temp['ynext'] = temp.groupby('ship')['y'].shift(-1)\n",
    "    temp['xnext'] = temp.groupby('ship')['x'].shift(-1)\n",
    "    # 将shift得到的差异量进行填充，为什么会有空值NaN？\n",
    "    # 因为shift的起始位置是没法比较的，故用空值来代替\n",
    "    temp['ynext'] = temp['ynext'].fillna(method='ffill')\n",
    "    temp['xnext'] = temp['xnext'].fillna(method='ffill')\n",
    "    # 这里笔者的理解是ynext/xnext，而不需要减去y和x，因为ynext和xnext本身就是偏移量了\n",
    "    temp['angle_next'] = (temp['ynext'] - temp['y']) / (temp['xnext'] - temp['x'])\n",
    "    temp['angle_next'] = np.arctan(temp['angle_next']) / math.pi * 180\n",
    "    temp['angle_next_next'] = temp['angle_next'].shift(-1)\n",
    "    temp['timediff'] = np.abs(temp['timenext'] - temp['time'])\n",
    "    temp['timediff'] = temp['timediff'].fillna(method='ffill')\n",
    "    temp['hc_xy'] = abs(temp['angle_next_next'] - temp['angle_next'])\n",
    "    # 对于hc_xy这列的值>180度的，进行修改成360度求差，仅考虑与水平线的角度\n",
    "    temp.loc[temp['hc_xy'] > 180, 'hc_xy'] = (360 - temp.loc[temp['hc_xy'] > 180, 'hc_xy'])\n",
    "    temp['hc_xy_s'] = temp.apply(lambda x: x['hc_xy'] / x['timediff'].total_seconds(), axis=1)\n",
    "\n",
    "    temp['d_next'] = temp.groupby('ship')['d'].shift(-1)\n",
    "    temp['hc_d'] = abs(temp['d_next'] - temp['d'])\n",
    "    temp.loc[temp['hc_d'] > 180, 'hc_d'] = 360 - temp.loc[temp['hc_d'] > 180, 'hc_d']\n",
    "    temp['hc_d_s'] = temp.apply(lambda x: x['hc_d'] / x['timediff'].total_seconds(), axis=1)\n",
    "\n",
    "    temp1 = temp[['ship', 'hc_xy_s', 'hc_d_s']]\n",
    "    xy_d_rate = temp1.groupby('ship')['hc_xy_s'].agg({'hc_xy_s_max': 'max',\n",
    "                                                      })\n",
    "    xy_d_rate = xy_d_rate.reset_index()\n",
    "    d_d_rate = temp1.groupby('ship')['hc_d_s'].agg({'hc_d_s_max': 'max',\n",
    "                                                    })\n",
    "    d_d_rate = d_d_rate.reset_index()\n",
    "\n",
    "    tmp = xy_d_rate.merge(d_d_rate, on='ship', how='left')\n",
    "    return tmp\n",
    "\n",
    "c5 = get_d_change_rate_fea(temp)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-04-06T09:40:59.036757Z",
     "start_time": "2021-04-06T09:40:58.989886Z"
    }
   },
   "outputs": [],
   "source": [
    "f1 = temp.merge(c1,on='ship',how='left')\n",
    "f1 = f1.merge(c2,on='ship',how='left')\n",
    "f1 = f1.merge(c3,on='ship',how='left')\n",
    "f1 = f1.merge(c4,on='ship',how='left')\n",
    "f1 = f1.merge(c5,on='ship',how='left')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 分箱特征"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## v、x、y的分箱特征"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-04-06T09:41:00.267094Z",
     "start_time": "2021-04-06T09:41:00.126455Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>v_bin</th>\n",
       "      <th>x_bin1</th>\n",
       "      <th>x_bin2</th>\n",
       "      <th>x_bin1_count</th>\n",
       "      <th>x_bin2_count</th>\n",
       "      <th>x_bin1_id_nunique</th>\n",
       "      <th>x_bin2_id_nunique</th>\n",
       "      <th>y_bin1</th>\n",
       "      <th>y_bin2</th>\n",
       "      <th>y_bin1_count</th>\n",
       "      <th>...</th>\n",
       "      <th>y_bin1_id_nunique</th>\n",
       "      <th>y_bin2_id_nunique</th>\n",
       "      <th>x_y_bin1</th>\n",
       "      <th>x_bin1_y_bin1_count</th>\n",
       "      <th>x_y_bin2</th>\n",
       "      <th>x_bin2_y_bin2_count</th>\n",
       "      <th>x_y_max</th>\n",
       "      <th>y_x_max</th>\n",
       "      <th>x_y_min</th>\n",
       "      <th>y_x_min</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0</td>\n",
       "      <td>615.0</td>\n",
       "      <td>116</td>\n",
       "      <td>8</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>512.0</td>\n",
       "      <td>2</td>\n",
       "      <td>...</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>-115954.675157</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>49790.106760</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.0</td>\n",
       "      <td>1</td>\n",
       "      <td>615.0</td>\n",
       "      <td>2</td>\n",
       "      <td>8</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>512.0</td>\n",
       "      <td>2</td>\n",
       "      <td>...</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>53070.048324</td>\n",
       "      <td>808.872353</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.0</td>\n",
       "      <td>2</td>\n",
       "      <td>615.0</td>\n",
       "      <td>2</td>\n",
       "      <td>8</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>512.0</td>\n",
       "      <td>2</td>\n",
       "      <td>...</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>-808.872353</td>\n",
       "      <td>54707.512092</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1.0</td>\n",
       "      <td>3</td>\n",
       "      <td>614.0</td>\n",
       "      <td>2</td>\n",
       "      <td>77</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>512.0</td>\n",
       "      <td>2</td>\n",
       "      <td>...</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>8</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>52951.293120</td>\n",
       "      <td>808.787673</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2.0</td>\n",
       "      <td>4</td>\n",
       "      <td>614.0</td>\n",
       "      <td>2</td>\n",
       "      <td>77</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>512.0</td>\n",
       "      <td>2</td>\n",
       "      <td>...</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>8</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>-808.787673</td>\n",
       "      <td>55461.653028</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 21 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   v_bin x_bin1  x_bin2  x_bin1_count  x_bin2_count  x_bin1_id_nunique  \\\n",
       "0    0.0      0   615.0           116             8                  2   \n",
       "1    0.0      1   615.0             2             8                  2   \n",
       "2    0.0      2   615.0             2             8                  2   \n",
       "3    1.0      3   614.0             2            77                  2   \n",
       "4    2.0      4   614.0             2            77                  2   \n",
       "\n",
       "   x_bin2_id_nunique y_bin1  y_bin2  y_bin1_count  ...  y_bin1_id_nunique  \\\n",
       "0                  2      0   512.0             2  ...                  2   \n",
       "1                  2      1   512.0             2  ...                  1   \n",
       "2                  2      1   512.0             2  ...                  1   \n",
       "3                  2      2   512.0             2  ...                  1   \n",
       "4                  2      2   512.0             2  ...                  1   \n",
       "\n",
       "   y_bin2_id_nunique  x_y_bin1  x_bin1_y_bin1_count  x_y_bin2  \\\n",
       "0                  1         0                    1         0   \n",
       "1                  1         1                    1         0   \n",
       "2                  1         2                    1         0   \n",
       "3                  1         3                    1         1   \n",
       "4                  1         4                    1         1   \n",
       "\n",
       "   x_bin2_y_bin2_count        x_y_max     y_x_max       x_y_min       y_x_min  \n",
       "0                    3 -115954.675157    0.000000      0.000000  49790.106760  \n",
       "1                    3       0.000000    0.000000  53070.048324    808.872353  \n",
       "2                    3       0.000000 -808.872353  54707.512092      0.000000  \n",
       "3                    8       0.000000    0.000000  52951.293120    808.787673  \n",
       "4                    8       0.000000 -808.787673  55461.653028      0.000000  \n",
       "\n",
       "[5 rows x 21 columns]"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pre_cols = df.columns\n",
    "\n",
    "df['v_bin'] = pd.qcut(df['v'], 200, duplicates='drop') # 速度进行 200分位数分箱\n",
    "df['v_bin'] = df['v_bin'].map(dict(zip(df['v_bin'].unique(), range(df['v_bin'].nunique())))) # 分箱后映射编码\n",
    "for f in ['x', 'y']:\n",
    "    df[f + '_bin1'] = pd.qcut(df[f], 1000, duplicates='drop') # x,y位置分箱1000\n",
    "    df[f + '_bin1'] = df[f + '_bin1'].map(dict(zip(df[f + '_bin1'].unique(), range(df[f + '_bin1'].nunique()))))#编码\n",
    "    df[f + '_bin2'] = df[f] // 10000 # 取整操作\n",
    "    df[f + '_bin1_count'] = df[f + '_bin1'].map(df[f + '_bin1'].value_counts()) #x,y不同分箱的数量映射\n",
    "    df[f + '_bin2_count'] = df[f + '_bin2'].map(df[f + '_bin2'].value_counts()) #数量映射\n",
    "    df[f + '_bin1_id_nunique'] = df.groupby(f + '_bin1')['id'].transform('nunique')#基于分箱1 id数量映射\n",
    "    df[f + '_bin2_id_nunique'] = df.groupby(f + '_bin2')['id'].transform('nunique')#基于分箱2 id数量映射\n",
    "for i in [1, 2]:\n",
    "    # 特征交叉x_bin1（2）,y_bin1（2） 形成类别 统计每类数量映射到列  \n",
    "    df['x_y_bin{}'.format(i)] = df['x_bin{}'.format(i)].astype('str') + '_' + df['y_bin{}'.format(i)].astype('str')\n",
    "    df['x_y_bin{}'.format(i)] = df['x_y_bin{}'.format(i)].map(\n",
    "        dict(zip(df['x_y_bin{}'.format(i)].unique(), range(df['x_y_bin{}'.format(i)].nunique())))\n",
    "    )\n",
    "    df['x_bin{}_y_bin{}_count'.format(i, i)] = df['x_y_bin{}'.format(i)].map(df['x_y_bin{}'.format(i)].value_counts())\n",
    "for stat in ['max', 'min']:\n",
    "    # 统计x_bin1 y_bin1的最大最小值\n",
    "    df['x_y_{}'.format(stat)] = df['y'] - df.groupby('x_bin1')['y'].transform(stat)\n",
    "    df['y_x_{}'.format(stat)] = df['x'] - df.groupby('y_bin1')['x'].transform(stat)\n",
    "\n",
    "new_cols = [i for i in df.columns if i not in pre_cols]\n",
    "df[new_cols].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##  将x、y进行分箱并构造区域"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-04-06T09:41:01.197017Z",
     "start_time": "2021-04-06T09:41:01.181086Z"
    },
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "def traj_to_bin(traj=None, x_min=12031967.16239096, x_max=14226964.881853,\n",
    "                y_min=1623579.449434373, y_max=4689471.1780792,\n",
    "                row_bins=4380, col_bins=3136):\n",
    "\n",
    "    # Establish bins on x direction and y direction\n",
    "    x_bins = np.linspace(x_min, x_max, endpoint=True, num=col_bins + 1)\n",
    "    y_bins = np.linspace(y_min, y_max, endpoint=True, num=row_bins + 1)\n",
    "\n",
    "    # Determine each x coordinate belong to which bin\n",
    "    traj.sort_values(by='x', inplace=True)\n",
    "    x_res = np.zeros((len(traj), ))\n",
    "    j = 0\n",
    "    for i in range(1, col_bins + 1):\n",
    "        low, high = x_bins[i-1], x_bins[i]\n",
    "        while( j < len(traj)):\n",
    "            # low - 0.001 for numeric stable.\n",
    "            if (traj[\"x\"].iloc[j] <= high) & (traj[\"x\"].iloc[j] > low - 0.001):\n",
    "                x_res[j] = i\n",
    "                j += 1\n",
    "            else:\n",
    "                break\n",
    "    traj[\"x_grid\"] = x_res\n",
    "    traj[\"x_grid\"] = traj[\"x_grid\"].astype(int)\n",
    "    traj[\"x_grid\"] = traj[\"x_grid\"].apply(str)\n",
    "\n",
    "    # Determine each y coordinate belong to which bin\n",
    "    traj.sort_values(by='y', inplace=True)\n",
    "    y_res = np.zeros((len(traj), ))\n",
    "    j = 0\n",
    "    for i in range(1, row_bins + 1):\n",
    "        low, high = y_bins[i-1], y_bins[i]\n",
    "        while( j < len(traj)):\n",
    "            # low - 0.001 for numeric stable.\n",
    "            if (traj[\"y\"].iloc[j] <= high) & (traj[\"y\"].iloc[j] > low - 0.001):\n",
    "                y_res[j] = i\n",
    "                j += 1\n",
    "            else:\n",
    "                break\n",
    "    traj[\"y_grid\"] = y_res\n",
    "    traj[\"y_grid\"] = traj[\"y_grid\"].astype(int)\n",
    "    traj[\"y_grid\"] = traj[\"y_grid\"].apply(str)\n",
    "\n",
    "    # Determine which bin each coordinate belongs to.\n",
    "    traj[\"no_bin\"] = [i + \"_\" + j for i, j in zip(\n",
    "        traj[\"x_grid\"].values.tolist(), traj[\"y_grid\"].values.tolist())]\n",
    "    traj.sort_values(by='time', inplace=True)\n",
    "    return traj\n",
    "\n",
    "bin_size = 800\n",
    "col_bins = int((14226964.881853 - 12031967.16239096) / bin_size)\n",
    "row_bins = int((4689471.1780792 - 1623579.449434373) / bin_size)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-04-06T09:41:01.968441Z",
     "start_time": "2021-04-06T09:41:01.791913Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>x_grid</th>\n",
       "      <th>y_grid</th>\n",
       "      <th>no_bin</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1606</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0_0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1605</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0_0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1604</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0_0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1603</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0_0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1602</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0_0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1988</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0_0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1987</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0_0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1986</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0_0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1985</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0_0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1984</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0_0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>2001 rows × 3 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "     x_grid y_grid no_bin\n",
       "1606      0      0    0_0\n",
       "1605      0      0    0_0\n",
       "1604      0      0    0_0\n",
       "1603      0      0    0_0\n",
       "1602      0      0    0_0\n",
       "...     ...    ...    ...\n",
       "1988      0      0    0_0\n",
       "1987      0      0    0_0\n",
       "1986      0      0    0_0\n",
       "1985      0      0    0_0\n",
       "1984      0      0    0_0\n",
       "\n",
       "[2001 rows x 3 columns]"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pre_cols = df.columns\n",
    "# 特征x_grid,y_grid,no_bin\n",
    "df = traj_to_bin(df)\n",
    "\n",
    "new_cols = [i for i in df.columns if i not in pre_cols]\n",
    "df[new_cols]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# DataFrame特征"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## count计数值"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-04-06T09:41:03.199290Z",
     "start_time": "2021-04-06T09:41:03.181338Z"
    }
   },
   "outputs": [],
   "source": [
    "def find_save_visit_count_table(traj_data_df=None, bin_to_coord_df=None):\n",
    "    \"\"\"Find and save the visit frequency of each bin.\"\"\"\n",
    "    visit_count_df = traj_data_df.groupby([\"no_bin\"]).count().reset_index()\n",
    "    visit_count_df = visit_count_df[[\"no_bin\", \"x\"]]\n",
    "    visit_count_df.rename({\"x\":\"visit_count\"}, axis=1, inplace=True)\n",
    "    return visit_count_df\n",
    "\n",
    "def find_save_unique_visit_count_table(traj_data_df=None, bin_to_coord_df=None):\n",
    "    \"\"\"Find and save the unique boat visit count of each bin.\"\"\"\n",
    "    unique_boat_count_df = traj_data_df.groupby([\"no_bin\"])[\"id\"].nunique().reset_index()\n",
    "    unique_boat_count_df.rename({\"id\":\"visit_boat_count\"}, axis=1, inplace=True)\n",
    "\n",
    "    unique_boat_count_df_save = pd.merge(bin_to_coord_df, unique_boat_count_df,\n",
    "                                         on=\"no_bin\", how=\"left\")\n",
    "    return unique_boat_count_df\n",
    "\n",
    "traj_df = df[[\"id\",\"x\", \"y\",'time',\"no_bin\"]]\n",
    "bin_to_coord_df = traj_df.groupby([\"no_bin\"]).median().reset_index()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-04-06T09:41:03.714709Z",
     "start_time": "2021-04-06T09:41:03.668832Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>visit_count</th>\n",
       "      <th>visit_boat_count</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2001</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2001</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2001</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2001</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2001</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   visit_count  visit_boat_count\n",
       "0         2001                 6\n",
       "1         2001                 6\n",
       "2         2001                 6\n",
       "3         2001                 6\n",
       "4         2001                 6"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pre_cols = df.columns\n",
    "\n",
    "# DataFrame tmp for finding POIs\n",
    "visit_count_df = find_save_visit_count_table(\n",
    "    traj_df, bin_to_coord_df)\n",
    "unique_boat_count_df = find_save_unique_visit_count_table(\n",
    "    traj_df, bin_to_coord_df)\n",
    "\n",
    "# # 特征'visit_count','visit_boat_count'\n",
    "df = df.merge(visit_count_df,on='no_bin',how='left')\n",
    "df = df.merge(unique_boat_count_df,on='no_bin',how='left')\n",
    "\n",
    "new_cols = [i for i in df.columns if i not in pre_cols]\n",
    "df[new_cols].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## shift偏移量特征"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-04-06T09:41:04.554883Z",
     "start_time": "2021-04-06T09:41:04.503988Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>x_prev_diff</th>\n",
       "      <th>x_next_diff</th>\n",
       "      <th>x_prev_next_diff</th>\n",
       "      <th>y_prev_diff</th>\n",
       "      <th>y_next_diff</th>\n",
       "      <th>y_prev_next_diff</th>\n",
       "      <th>dist_move_prev</th>\n",
       "      <th>dist_move_next</th>\n",
       "      <th>dist_move_prev_next</th>\n",
       "      <th>dist_move_prev_bin</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>NaN</td>\n",
       "      <td>-911.903731</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>455.919062</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1019.524696</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>911.903731</td>\n",
       "      <td>-911.965576</td>\n",
       "      <td>-1823.869307</td>\n",
       "      <td>-455.919062</td>\n",
       "      <td>455.831205</td>\n",
       "      <td>911.750267</td>\n",
       "      <td>1019.524696</td>\n",
       "      <td>1019.540730</td>\n",
       "      <td>2039.065423</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>911.965576</td>\n",
       "      <td>-918.791508</td>\n",
       "      <td>-1830.757085</td>\n",
       "      <td>-455.831205</td>\n",
       "      <td>20.360332</td>\n",
       "      <td>476.191538</td>\n",
       "      <td>1019.540730</td>\n",
       "      <td>919.017072</td>\n",
       "      <td>1891.673831</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>918.791508</td>\n",
       "      <td>-597.354368</td>\n",
       "      <td>-1516.145877</td>\n",
       "      <td>-20.360332</td>\n",
       "      <td>993.131365</td>\n",
       "      <td>1013.491697</td>\n",
       "      <td>919.017072</td>\n",
       "      <td>1158.940097</td>\n",
       "      <td>1823.695078</td>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>597.354368</td>\n",
       "      <td>-910.468269</td>\n",
       "      <td>-1507.822637</td>\n",
       "      <td>-993.131365</td>\n",
       "      <td>564.435006</td>\n",
       "      <td>1557.566370</td>\n",
       "      <td>1158.940097</td>\n",
       "      <td>1071.232628</td>\n",
       "      <td>2167.842730</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   x_prev_diff  x_next_diff  x_prev_next_diff  y_prev_diff  y_next_diff  \\\n",
       "0          NaN  -911.903731               NaN          NaN   455.919062   \n",
       "1   911.903731  -911.965576      -1823.869307  -455.919062   455.831205   \n",
       "2   911.965576  -918.791508      -1830.757085  -455.831205    20.360332   \n",
       "3   918.791508  -597.354368      -1516.145877   -20.360332   993.131365   \n",
       "4   597.354368  -910.468269      -1507.822637  -993.131365   564.435006   \n",
       "\n",
       "   y_prev_next_diff  dist_move_prev  dist_move_next  dist_move_prev_next  \\\n",
       "0               NaN             NaN     1019.524696                  NaN   \n",
       "1        911.750267     1019.524696     1019.540730          2039.065423   \n",
       "2        476.191538     1019.540730      919.017072          1891.673831   \n",
       "3       1013.491697      919.017072     1158.940097          1823.695078   \n",
       "4       1557.566370     1158.940097     1071.232628          2167.842730   \n",
       "\n",
       "   dist_move_prev_bin  \n",
       "0                 NaN  \n",
       "1                 1.0  \n",
       "2                 1.0  \n",
       "3                 2.0  \n",
       "4                 3.0  "
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pre_cols = df.columns\n",
    "\n",
    "g = df.groupby('id')\n",
    "for f in ['x', 'y']:\n",
    "    #对x,y坐标进行时间平移 1 -1 2\n",
    "    df[f + '_prev_diff'] = df[f] - g[f].shift(1)\n",
    "    df[f + '_next_diff'] = df[f] - g[f].shift(-1)\n",
    "    df[f + '_prev_next_diff'] = g[f].shift(1) - g[f].shift(-1)\n",
    "    ## 三角形求解上时刻1距离  下时刻-1距离 2距离 \n",
    "df['dist_move_prev'] = np.sqrt(np.square(df['x_prev_diff']) + np.square(df['y_prev_diff']))\n",
    "df['dist_move_next'] = np.sqrt(np.square(df['x_next_diff']) + np.square(df['y_next_diff']))\n",
    "df['dist_move_prev_next'] = np.sqrt(np.square(df['x_prev_next_diff']) + np.square(df['y_prev_next_diff']))\n",
    "df['dist_move_prev_bin'] = pd.qcut(df['dist_move_prev'], 50, duplicates='drop')# 2时刻距离等频分箱50\n",
    "df['dist_move_prev_bin'] = df['dist_move_prev_bin'].map(\n",
    "    dict(zip(df['dist_move_prev_bin'].unique(), range(df['dist_move_prev_bin'].nunique())))\n",
    ") #上一时刻映射编码\n",
    "\n",
    "new_cols = [i for i in df.columns if i not in pre_cols]\n",
    "df[new_cols].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 统计特征"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 基本统计特征用法"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "补充：\n",
    "\n",
    "分组统计特征agg的使用非常重要，在此进行代码示例，详细请参考：\n",
    "http://joyfulpandas.datawhale.club/Content/ch4.html\n",
    "\n",
    "- 请注意{}和[]的使用\n",
    "\n",
    "分组标准格式：\n",
    "\n",
    "df.groupby(分组依据)[数据来源].使用操作\n",
    "\n",
    "先分组，得到\n",
    "\n",
    "gb = df.groupby(['School', 'Grade'])\n",
    "\n",
    "- 【a】使用多个函数\n",
    "\n",
    "gb.agg(['具体方法（如内置函数）'])\n",
    "\n",
    "如gb.agg(['sum'])\n",
    "\n",
    "\n",
    "- 【b】对特定的列使用特定的聚合函数\n",
    "\n",
    "gb.agg({'指定列':'具体方法'})\n",
    "\n",
    "如gb.agg({'Height':['mean','max'], 'Weight':'count'})\n",
    "\n",
    "- 【c】使用自定义函数\n",
    "\n",
    "gb.agg(函数名或匿名函数)\n",
    "\n",
    "如gb.agg(lambda x: x.mean()-x.min())\n",
    "\n",
    "- 【d】聚合结果重命名\n",
    "\n",
    "gb.agg([\n",
    "    ('重命名的名字',具体方法（如内置函数、自定义函数）)\n",
    "])\n",
    "\n",
    "如gb.agg([('range', lambda x: x.max()-x.min()), ('my_sum', 'sum')])\n",
    "\n",
    "另外需要注意，使用对一个或者多个列使用单个聚合的时候，重命名需要加方括号，否则就不知道是新的名字还是手误输错的内置函数字符串：\n",
    "\n",
    "- 下述代码主要使用了\n",
    "\n",
    "一种是df.groupby('id').agg{'列名':'方法'}，另一种是df.groupby('id')['列名'].agg(字典)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-04-06T09:41:08.013040Z",
     "start_time": "2021-04-06T09:41:07.908757Z"
    }
   },
   "outputs": [],
   "source": [
    "pre_cols = df.columns\n",
    "\n",
    "def start(x):\n",
    "    try:\n",
    "        return x[0]\n",
    "    except:\n",
    "        return None\n",
    "\n",
    "def end(x):\n",
    "    try:\n",
    "        return x[-1]\n",
    "    except:\n",
    "        return None\n",
    "\n",
    "\n",
    "def mode(x):\n",
    "    try:\n",
    "        return pd.Series(x).value_counts().index[0]\n",
    "    except:\n",
    "        return None\n",
    "\n",
    "for f in ['dist_move_prev_bin', 'v_bin']:\n",
    "    # 上一时刻类别 速度类别映射处理\n",
    "    df[f + '_sen'] = df['id'].map(df.groupby('id')[f].agg(lambda x: ','.join(x.astype(str))))\n",
    "    \n",
    "    # 一系列基本统计量特征 每列执行相应的操作\n",
    "g = df.groupby('id').agg({\n",
    "    'id': ['count'], 'x_bin1': [mode], 'y_bin1': [mode], 'x_bin2': [mode], 'y_bin2': [mode], 'x_y_bin1': [mode],\n",
    "    'x': ['mean', 'max', 'min', 'std', np.ptp, start, end],\n",
    "    'y': ['mean', 'max', 'min', 'std', np.ptp, start, end],\n",
    "    'v': ['mean', 'max', 'min', 'std', np.ptp], 'dir': ['mean'],\n",
    "    'x_bin1_count': ['mean'], 'y_bin1_count': ['mean', 'max', 'min'],\n",
    "    'x_bin2_count': ['mean', 'max', 'min'], 'y_bin2_count': ['mean', 'max', 'min'],\n",
    "    'x_bin1_y_bin1_count': ['mean', 'max', 'min'],\n",
    "    'dist_move_prev': ['mean', 'max', 'std', 'min', 'sum'],\n",
    "    'x_y_min': ['mean', 'min'], 'y_x_min': ['mean', 'min'],\n",
    "    'x_y_max': ['mean', 'min'], 'y_x_max': ['mean', 'min'],\n",
    "}).reset_index()\n",
    "g.columns = ['_'.join(col).strip() for col in g.columns] #提取列名\n",
    "g.rename(columns={'id_': 'id'}, inplace=True) #重命名id_\n",
    "cols = [f for f in g.keys() if f != 'id'] #特征列名提取"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-04-06T09:41:08.666832Z",
     "start_time": "2021-04-06T09:41:08.616927Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>dist_move_prev_bin_sen</th>\n",
       "      <th>v_bin_sen</th>\n",
       "      <th>id_count</th>\n",
       "      <th>x_bin1_mode</th>\n",
       "      <th>y_bin1_mode</th>\n",
       "      <th>x_bin2_mode</th>\n",
       "      <th>y_bin2_mode</th>\n",
       "      <th>x_y_bin1_mode</th>\n",
       "      <th>x_mean</th>\n",
       "      <th>x_max</th>\n",
       "      <th>...</th>\n",
       "      <th>dist_move_prev_min</th>\n",
       "      <th>dist_move_prev_sum</th>\n",
       "      <th>x_y_min_mean</th>\n",
       "      <th>x_y_min_min</th>\n",
       "      <th>y_x_min_mean</th>\n",
       "      <th>y_x_min_min</th>\n",
       "      <th>x_y_max_mean</th>\n",
       "      <th>x_y_max_min</th>\n",
       "      <th>y_x_max_mean</th>\n",
       "      <th>y_x_max_min</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>nan,1.0,1.0,2.0,3.0,4.0,2.0,5.0,3.0,5.0,5.0,5....</td>\n",
       "      <td>19.0,26.0,19.0,2.0,16.0,0.0,30.0,19.0,19.0,0.0...</td>\n",
       "      <td>411</td>\n",
       "      <td>145</td>\n",
       "      <td>88</td>\n",
       "      <td>611.0</td>\n",
       "      <td>508.0</td>\n",
       "      <td>252</td>\n",
       "      <td>6.123711e+06</td>\n",
       "      <td>6.151439e+06</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>381420.840554</td>\n",
       "      <td>2458.92664</td>\n",
       "      <td>0.0</td>\n",
       "      <td>4603.814472</td>\n",
       "      <td>0.0</td>\n",
       "      <td>-5075.500661</td>\n",
       "      <td>-57432.286364</td>\n",
       "      <td>-3493.862248</td>\n",
       "      <td>-32066.348374</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>nan,1.0,1.0,2.0,3.0,4.0,2.0,5.0,3.0,5.0,5.0,5....</td>\n",
       "      <td>19.0,26.0,19.0,2.0,16.0,0.0,30.0,19.0,19.0,0.0...</td>\n",
       "      <td>411</td>\n",
       "      <td>145</td>\n",
       "      <td>88</td>\n",
       "      <td>611.0</td>\n",
       "      <td>508.0</td>\n",
       "      <td>252</td>\n",
       "      <td>6.123711e+06</td>\n",
       "      <td>6.151439e+06</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>381420.840554</td>\n",
       "      <td>2458.92664</td>\n",
       "      <td>0.0</td>\n",
       "      <td>4603.814472</td>\n",
       "      <td>0.0</td>\n",
       "      <td>-5075.500661</td>\n",
       "      <td>-57432.286364</td>\n",
       "      <td>-3493.862248</td>\n",
       "      <td>-32066.348374</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>nan,1.0,1.0,2.0,3.0,4.0,2.0,5.0,3.0,5.0,5.0,5....</td>\n",
       "      <td>19.0,26.0,19.0,2.0,16.0,0.0,30.0,19.0,19.0,0.0...</td>\n",
       "      <td>411</td>\n",
       "      <td>145</td>\n",
       "      <td>88</td>\n",
       "      <td>611.0</td>\n",
       "      <td>508.0</td>\n",
       "      <td>252</td>\n",
       "      <td>6.123711e+06</td>\n",
       "      <td>6.151439e+06</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>381420.840554</td>\n",
       "      <td>2458.92664</td>\n",
       "      <td>0.0</td>\n",
       "      <td>4603.814472</td>\n",
       "      <td>0.0</td>\n",
       "      <td>-5075.500661</td>\n",
       "      <td>-57432.286364</td>\n",
       "      <td>-3493.862248</td>\n",
       "      <td>-32066.348374</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>nan,1.0,1.0,2.0,3.0,4.0,2.0,5.0,3.0,5.0,5.0,5....</td>\n",
       "      <td>19.0,26.0,19.0,2.0,16.0,0.0,30.0,19.0,19.0,0.0...</td>\n",
       "      <td>411</td>\n",
       "      <td>145</td>\n",
       "      <td>88</td>\n",
       "      <td>611.0</td>\n",
       "      <td>508.0</td>\n",
       "      <td>252</td>\n",
       "      <td>6.123711e+06</td>\n",
       "      <td>6.151439e+06</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>381420.840554</td>\n",
       "      <td>2458.92664</td>\n",
       "      <td>0.0</td>\n",
       "      <td>4603.814472</td>\n",
       "      <td>0.0</td>\n",
       "      <td>-5075.500661</td>\n",
       "      <td>-57432.286364</td>\n",
       "      <td>-3493.862248</td>\n",
       "      <td>-32066.348374</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>nan,1.0,1.0,2.0,3.0,4.0,2.0,5.0,3.0,5.0,5.0,5....</td>\n",
       "      <td>19.0,26.0,19.0,2.0,16.0,0.0,30.0,19.0,19.0,0.0...</td>\n",
       "      <td>411</td>\n",
       "      <td>145</td>\n",
       "      <td>88</td>\n",
       "      <td>611.0</td>\n",
       "      <td>508.0</td>\n",
       "      <td>252</td>\n",
       "      <td>6.123711e+06</td>\n",
       "      <td>6.151439e+06</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>381420.840554</td>\n",
       "      <td>2458.92664</td>\n",
       "      <td>0.0</td>\n",
       "      <td>4603.814472</td>\n",
       "      <td>0.0</td>\n",
       "      <td>-5075.500661</td>\n",
       "      <td>-57432.286364</td>\n",
       "      <td>-3493.862248</td>\n",
       "      <td>-32066.348374</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 54 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                              dist_move_prev_bin_sen  \\\n",
       "0  nan,1.0,1.0,2.0,3.0,4.0,2.0,5.0,3.0,5.0,5.0,5....   \n",
       "1  nan,1.0,1.0,2.0,3.0,4.0,2.0,5.0,3.0,5.0,5.0,5....   \n",
       "2  nan,1.0,1.0,2.0,3.0,4.0,2.0,5.0,3.0,5.0,5.0,5....   \n",
       "3  nan,1.0,1.0,2.0,3.0,4.0,2.0,5.0,3.0,5.0,5.0,5....   \n",
       "4  nan,1.0,1.0,2.0,3.0,4.0,2.0,5.0,3.0,5.0,5.0,5....   \n",
       "\n",
       "                                           v_bin_sen  id_count  x_bin1_mode  \\\n",
       "0  19.0,26.0,19.0,2.0,16.0,0.0,30.0,19.0,19.0,0.0...       411          145   \n",
       "1  19.0,26.0,19.0,2.0,16.0,0.0,30.0,19.0,19.0,0.0...       411          145   \n",
       "2  19.0,26.0,19.0,2.0,16.0,0.0,30.0,19.0,19.0,0.0...       411          145   \n",
       "3  19.0,26.0,19.0,2.0,16.0,0.0,30.0,19.0,19.0,0.0...       411          145   \n",
       "4  19.0,26.0,19.0,2.0,16.0,0.0,30.0,19.0,19.0,0.0...       411          145   \n",
       "\n",
       "   y_bin1_mode  x_bin2_mode  y_bin2_mode  x_y_bin1_mode        x_mean  \\\n",
       "0           88        611.0        508.0            252  6.123711e+06   \n",
       "1           88        611.0        508.0            252  6.123711e+06   \n",
       "2           88        611.0        508.0            252  6.123711e+06   \n",
       "3           88        611.0        508.0            252  6.123711e+06   \n",
       "4           88        611.0        508.0            252  6.123711e+06   \n",
       "\n",
       "          x_max  ...  dist_move_prev_min  dist_move_prev_sum  x_y_min_mean  \\\n",
       "0  6.151439e+06  ...                 0.0       381420.840554    2458.92664   \n",
       "1  6.151439e+06  ...                 0.0       381420.840554    2458.92664   \n",
       "2  6.151439e+06  ...                 0.0       381420.840554    2458.92664   \n",
       "3  6.151439e+06  ...                 0.0       381420.840554    2458.92664   \n",
       "4  6.151439e+06  ...                 0.0       381420.840554    2458.92664   \n",
       "\n",
       "   x_y_min_min  y_x_min_mean  y_x_min_min  x_y_max_mean   x_y_max_min  \\\n",
       "0          0.0   4603.814472          0.0  -5075.500661 -57432.286364   \n",
       "1          0.0   4603.814472          0.0  -5075.500661 -57432.286364   \n",
       "2          0.0   4603.814472          0.0  -5075.500661 -57432.286364   \n",
       "3          0.0   4603.814472          0.0  -5075.500661 -57432.286364   \n",
       "4          0.0   4603.814472          0.0  -5075.500661 -57432.286364   \n",
       "\n",
       "   y_x_max_mean   y_x_max_min  \n",
       "0  -3493.862248 -32066.348374  \n",
       "1  -3493.862248 -32066.348374  \n",
       "2  -3493.862248 -32066.348374  \n",
       "3  -3493.862248 -32066.348374  \n",
       "4  -3493.862248 -32066.348374  \n",
       "\n",
       "[5 rows x 54 columns]"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = df.merge(g,on='id',how='left')\n",
    "\n",
    "new_cols = [i for i in df.columns if i not in pre_cols]\n",
    "df[new_cols].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 划分数据后进行统计"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-04-06T09:41:09.726927Z",
     "start_time": "2021-04-06T09:41:09.702958Z"
    }
   },
   "outputs": [],
   "source": [
    "def group_feature(df, key, target, aggs,flag):   \n",
    "    \"\"\"通过字典的形式来构建方法和重命名\"\"\"\n",
    "    agg_dict = {}\n",
    "    for ag in aggs:\n",
    "        agg_dict['{}_{}_{}'.format(target,ag,flag)] = ag\n",
    "#     print(agg_dict)\n",
    "    t = df.groupby(key)[target].agg(agg_dict).reset_index()\n",
    "    return t\n",
    "\n",
    "def extract_feature(df, train, flag):\n",
    "    '''\n",
    "    统计feature\n",
    "    注意理解group_feature的使用和效果\n",
    "    '''\n",
    "    if (flag == 'on_night') or (flag == 'on_day'): \n",
    "        t = group_feature(df, 'ship','speed',['max','mean','median','std','skew'],flag)\n",
    "        train = pd.merge(train, t, on='ship', how='left')\n",
    "        # return train\n",
    "    \n",
    "    \n",
    "    if flag == \"0\":\n",
    "        t = group_feature(df, 'ship','direction',['max','median','mean','std','skew'],flag)\n",
    "        train = pd.merge(train, t, on='ship', how='left')  \n",
    "    elif flag == \"1\":\n",
    "        t = group_feature(df, 'ship','speed',['max','mean','median','std','skew'],flag)\n",
    "        train = pd.merge(train, t, on='ship', how='left')\n",
    "        t = group_feature(df, 'ship','direction',['max','median','mean','std','skew'],flag)\n",
    "        train = pd.merge(train, t, on='ship', how='left') \n",
    "        # .nunique().to_dict() 将nunique得到的对应唯一值统计量做成字典\n",
    "        # to_dict() 与 map的使用可以很方便地构建一些统计量映射特征，如CTR（分类）问题中的转化率\n",
    "        # 提问： 如果根据训练集给定的label(0,1)来构建训练集+测试集的转化率特征，注：测试集与训练集存在部分id相同\n",
    "        hour_nunique = df.groupby('ship')['speed'].nunique().to_dict()\n",
    "        train['speed_nunique_{}'.format(flag)] = train['ship'].map(hour_nunique)   \n",
    "        hour_nunique = df.groupby('ship')['direction'].nunique().to_dict()\n",
    "        train['direction_nunique_{}'.format(flag)] = train['ship'].map(hour_nunique)  \n",
    "\n",
    "    t = group_feature(df, 'ship','x',['max','min','mean','median','std','skew'],flag)\n",
    "    train = pd.merge(train, t, on='ship', how='left')\n",
    "    t = group_feature(df, 'ship','y',['max','min','mean','median','std','skew'],flag)\n",
    "    train = pd.merge(train, t, on='ship', how='left')\n",
    "    t = group_feature(df, 'ship','base_dis_diff',['max','min','mean','std','skew'],flag)\n",
    "    train = pd.merge(train, t, on='ship', how='left')\n",
    "\n",
    "       \n",
    "    train['x_max_x_min_{}'.format(flag)] = train['x_max_{}'.format(flag)] - train['x_min_{}'.format(flag)]\n",
    "    train['y_max_y_min_{}'.format(flag)] = train['y_max_{}'.format(flag)] - train['y_min_{}'.format(flag)]\n",
    "    train['y_max_x_min_{}'.format(flag)] = train['y_max_{}'.format(flag)] - train['x_min_{}'.format(flag)]\n",
    "    train['x_max_y_min_{}'.format(flag)] = train['x_max_{}'.format(flag)] - train['y_min_{}'.format(flag)]\n",
    "    train['slope_{}'.format(flag)] = train['y_max_y_min_{}'.format(flag)] / np.where(train['x_max_x_min_{}'.format(flag)]==0, 0.001, train['x_max_x_min_{}'.format(flag)])\n",
    "    train['area_{}'.format(flag)] = train['x_max_x_min_{}'.format(flag)] * train['y_max_y_min_{}'.format(flag)] \n",
    "    \n",
    "    mode_hour = df.groupby('ship')['hour'].agg(lambda x:x.value_counts().index[0]).to_dict()\n",
    "    train['mode_hour_{}'.format(flag)] = train['ship'].map(mode_hour)\n",
    "    train['slope_median_{}'.format(flag)] = train['y_median_{}'.format(flag)] / np.where(train['x_median_{}'.format(flag)]==0, 0.001, train['x_median_{}'.format(flag)])\n",
    "\n",
    "    return train"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-04-06T09:41:11.295988Z",
     "start_time": "2021-04-06T09:41:10.995520Z"
    }
   },
   "outputs": [],
   "source": [
    "data  = df.copy()\n",
    "data.rename(columns={\n",
    "    'id':'ship',\n",
    "    'v':'speed',\n",
    "    'dir':'direction'\n",
    "},inplace=True)\n",
    "# 去重\n",
    "data_label = data.drop_duplicates(['ship'],keep = 'first')\n",
    "\n",
    "data_1 = data[data['speed']==0]\n",
    "data_2 = data[data['speed']!=0]\n",
    "data_label = extract_feature(data_1, data_label,\"0\")\n",
    "data_label = extract_feature(data_2, data_label,\"1\")\n",
    "\n",
    "data_1 = data[data['day_nig'] == 0]\n",
    "data_2 = data[data['day_nig'] == 1]\n",
    "data_label = extract_feature(data_1, data_label,\"on_night\")\n",
    "data_label = extract_feature(data_2, data_label,\"on_day\")\n",
    "data_label.rename(columns={'ship':'id','speed':'v','direction':'dir'},inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-04-06T09:41:11.527562Z",
     "start_time": "2021-04-06T09:41:11.473706Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>direction_max_0</th>\n",
       "      <th>direction_median_0</th>\n",
       "      <th>direction_mean_0</th>\n",
       "      <th>direction_std_0</th>\n",
       "      <th>direction_skew_0</th>\n",
       "      <th>x_max_0</th>\n",
       "      <th>x_min_0</th>\n",
       "      <th>x_mean_0</th>\n",
       "      <th>x_median_0</th>\n",
       "      <th>x_std_0</th>\n",
       "      <th>...</th>\n",
       "      <th>base_dis_diff_std_on_day</th>\n",
       "      <th>base_dis_diff_skew_on_day</th>\n",
       "      <th>x_max_x_min_on_day</th>\n",
       "      <th>y_max_y_min_on_day</th>\n",
       "      <th>y_max_x_min_on_day</th>\n",
       "      <th>x_max_y_min_on_day</th>\n",
       "      <th>slope_on_day</th>\n",
       "      <th>area_on_day</th>\n",
       "      <th>mode_hour_on_day</th>\n",
       "      <th>slope_median_on_day</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>6.102751e+06</td>\n",
       "      <td>6.102751e+06</td>\n",
       "      <td>6.102751e+06</td>\n",
       "      <td>6.102751e+06</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>9650.263589</td>\n",
       "      <td>-0.389598</td>\n",
       "      <td>45396.666092</td>\n",
       "      <td>43135.705758</td>\n",
       "      <td>-989573.982047</td>\n",
       "      <td>1.078106e+06</td>\n",
       "      <td>0.950195</td>\n",
       "      <td>1.958217e+09</td>\n",
       "      <td>19</td>\n",
       "      <td>0.831333</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>6.102751e+06</td>\n",
       "      <td>6.102751e+06</td>\n",
       "      <td>6.102751e+06</td>\n",
       "      <td>6.102751e+06</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>9650.263589</td>\n",
       "      <td>-0.389598</td>\n",
       "      <td>45396.666092</td>\n",
       "      <td>43135.705758</td>\n",
       "      <td>-989573.982047</td>\n",
       "      <td>1.078106e+06</td>\n",
       "      <td>0.950195</td>\n",
       "      <td>1.958217e+09</td>\n",
       "      <td>19</td>\n",
       "      <td>0.831333</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>6.102751e+06</td>\n",
       "      <td>6.102751e+06</td>\n",
       "      <td>6.102751e+06</td>\n",
       "      <td>6.102751e+06</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>9650.263589</td>\n",
       "      <td>-0.389598</td>\n",
       "      <td>45396.666092</td>\n",
       "      <td>43135.705758</td>\n",
       "      <td>-989573.982047</td>\n",
       "      <td>1.078106e+06</td>\n",
       "      <td>0.950195</td>\n",
       "      <td>1.958217e+09</td>\n",
       "      <td>19</td>\n",
       "      <td>0.831333</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>6.102751e+06</td>\n",
       "      <td>6.102751e+06</td>\n",
       "      <td>6.102751e+06</td>\n",
       "      <td>6.102751e+06</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>9650.263589</td>\n",
       "      <td>-0.389598</td>\n",
       "      <td>45396.666092</td>\n",
       "      <td>43135.705758</td>\n",
       "      <td>-989573.982047</td>\n",
       "      <td>1.078106e+06</td>\n",
       "      <td>0.950195</td>\n",
       "      <td>1.958217e+09</td>\n",
       "      <td>19</td>\n",
       "      <td>0.831333</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>6.102751e+06</td>\n",
       "      <td>6.102751e+06</td>\n",
       "      <td>6.102751e+06</td>\n",
       "      <td>6.102751e+06</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>9650.263589</td>\n",
       "      <td>-0.389598</td>\n",
       "      <td>45396.666092</td>\n",
       "      <td>43135.705758</td>\n",
       "      <td>-989573.982047</td>\n",
       "      <td>1.078106e+06</td>\n",
       "      <td>0.950195</td>\n",
       "      <td>1.958217e+09</td>\n",
       "      <td>19</td>\n",
       "      <td>0.831333</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 127 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   direction_max_0  direction_median_0  direction_mean_0  direction_std_0  \\\n",
       "0                0                 0.0               0.0              0.0   \n",
       "1                0                 0.0               0.0              0.0   \n",
       "2                0                 0.0               0.0              0.0   \n",
       "3                0                 0.0               0.0              0.0   \n",
       "4                0                 0.0               0.0              0.0   \n",
       "\n",
       "   direction_skew_0       x_max_0       x_min_0      x_mean_0    x_median_0  \\\n",
       "0               0.0  6.102751e+06  6.102751e+06  6.102751e+06  6.102751e+06   \n",
       "1               0.0  6.102751e+06  6.102751e+06  6.102751e+06  6.102751e+06   \n",
       "2               0.0  6.102751e+06  6.102751e+06  6.102751e+06  6.102751e+06   \n",
       "3               0.0  6.102751e+06  6.102751e+06  6.102751e+06  6.102751e+06   \n",
       "4               0.0  6.102751e+06  6.102751e+06  6.102751e+06  6.102751e+06   \n",
       "\n",
       "   x_std_0  ...  base_dis_diff_std_on_day  base_dis_diff_skew_on_day  \\\n",
       "0      0.0  ...               9650.263589                  -0.389598   \n",
       "1      0.0  ...               9650.263589                  -0.389598   \n",
       "2      0.0  ...               9650.263589                  -0.389598   \n",
       "3      0.0  ...               9650.263589                  -0.389598   \n",
       "4      0.0  ...               9650.263589                  -0.389598   \n",
       "\n",
       "   x_max_x_min_on_day  y_max_y_min_on_day  y_max_x_min_on_day  \\\n",
       "0        45396.666092        43135.705758      -989573.982047   \n",
       "1        45396.666092        43135.705758      -989573.982047   \n",
       "2        45396.666092        43135.705758      -989573.982047   \n",
       "3        45396.666092        43135.705758      -989573.982047   \n",
       "4        45396.666092        43135.705758      -989573.982047   \n",
       "\n",
       "   x_max_y_min_on_day  slope_on_day   area_on_day  mode_hour_on_day  \\\n",
       "0        1.078106e+06      0.950195  1.958217e+09                19   \n",
       "1        1.078106e+06      0.950195  1.958217e+09                19   \n",
       "2        1.078106e+06      0.950195  1.958217e+09                19   \n",
       "3        1.078106e+06      0.950195  1.958217e+09                19   \n",
       "4        1.078106e+06      0.950195  1.958217e+09                19   \n",
       "\n",
       "   slope_median_on_day  \n",
       "0             0.831333  \n",
       "1             0.831333  \n",
       "2             0.831333  \n",
       "3             0.831333  \n",
       "4             0.831333  \n",
       "\n",
       "[5 rows x 127 columns]"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "new_cols = [i for i in data_label.columns if i not in df.columns]\n",
    "df = df.merge(data_label[new_cols+['id']],on='id',how='left')\n",
    "\n",
    "df[new_cols].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 统计特征的具体使用"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-04-06T09:41:13.059297Z",
     "start_time": "2021-04-06T09:41:12.464664Z"
    }
   },
   "outputs": [],
   "source": [
    "temp = df.copy()\n",
    "temp.rename(columns={'id':'ship','dir':'d'},inplace=True)\n",
    "\n",
    "def coefficient_of_variation(x):\n",
    "    x = x.values\n",
    "    if np.mean(x) == 0:\n",
    "        return 0\n",
    "    return np.std(x) / np.mean(x)\n",
    "\n",
    "def max_2(x):\n",
    "    x = list(x.values)\n",
    "    x.sort(reverse=True)\n",
    "    return x[1]\n",
    "\n",
    "def max_3(x):\n",
    "    x = list(x.values)\n",
    "    x.sort(reverse=True)\n",
    "    return x[2]\n",
    "\n",
    "def diff_abs_mean(x):  # 统计特征 deta绝对值均值\n",
    "    return np.mean(np.abs(np.diff(x)))\n",
    "\n",
    "f1 = pd.DataFrame()\n",
    "for col in ['x', 'y', 'v', 'd']:\n",
    "    features = temp.groupby('ship', as_index=False)[col].agg({\n",
    "        '{}_min'.format(col): 'min',\n",
    "        '{}_max'.format(col): 'max',\n",
    "        '{}_mean'.format(col): 'mean',\n",
    "        '{}_median'.format(col): 'median',\n",
    "        '{}_std'.format(col): 'std',\n",
    "        '{}_skew'.format(col): 'skew',\n",
    "        '{}_sum'.format(col): 'sum',\n",
    "        '{}_diff_abs_mean'.format(col): diff_abs_mean,\n",
    "        '{}_mode'.format(col): lambda x: x.value_counts().index[0],\n",
    "        '{}_coefficient_of_variation'.format(col): coefficient_of_variation,\n",
    "        '{}_max2'.format(col): max_2,\n",
    "        '{}_max3'.format(col): max_3\n",
    "    })\n",
    "    if f1.shape[0] == 0:\n",
    "        f1 = features\n",
    "    else:\n",
    "        f1 = f1.merge(features, on='ship', how='left')\n",
    "\n",
    "f1['x_max_x_min'] = f1['x_max'] - f1['x_min']\n",
    "f1['y_max_y_min'] = f1['y_max'] - f1['y_min']\n",
    "f1['y_max_x_min'] = f1['y_max'] - f1['x_min']\n",
    "f1['x_max_y_min'] = f1['x_max'] - f1['y_min']\n",
    "f1['slope'] = f1['y_max_y_min'] / np.where(f1['x_max_x_min'] == 0, 0.001, f1['x_max_x_min'])\n",
    "f1['area'] = f1['x_max_x_min'] * f1['y_max_y_min']\n",
    "f1['dis_max_min'] = (f1['x_max_x_min'] ** 2 + f1['y_max_y_min'] ** 2) ** 0.5\n",
    "f1['dis_mean'] = (f1['x_mean'] ** 2 + f1['y_mean'] ** 2) ** 0.5\n",
    "f1['area_d_dis_max_min'] = f1['area'] / f1['dis_max_min']\n",
    "\n",
    "# 加速度\n",
    "temp.sort_values(['ship', 'time'], ascending=True, inplace=True)\n",
    "temp['ynext'] = temp.groupby('ship')['y'].shift(-1)\n",
    "temp['xnext'] = temp.groupby('ship')['x'].shift(-1)\n",
    "temp['ynext'] = temp['ynext'].fillna(method='ffill')\n",
    "temp['xnext'] = temp['xnext'].fillna(method='ffill')\n",
    "temp['timenext'] = temp.groupby('ship')['time'].shift(-1)\n",
    "temp['timediff'] = np.abs(temp['timenext'] - temp['time'])\n",
    "temp['a_y'] = temp.apply(lambda x: (x['ynext'] - x['y']) / x['timediff'].total_seconds(), axis=1)\n",
    "temp['a_x'] = temp.apply(lambda x: (x['xnext'] - x['x']) / x['timediff'].total_seconds(), axis=1)\n",
    "for col in ['a_y', 'a_x']:\n",
    "    f2 = temp.groupby('ship', as_index=False)[col].agg({\n",
    "        '{}_max'.format(col): 'max',\n",
    "        '{}_mean'.format(col): 'mean',\n",
    "        '{}_min'.format(col): 'min',\n",
    "        '{}_median'.format(col): 'median',\n",
    "        '{}_std'.format(col): 'std'})\n",
    "    f1 = f1.merge(f2, on='ship', how='left')\n",
    "\n",
    "# 曲率\n",
    "temp['y_pre'] = temp.groupby('ship')['y'].shift(1)\n",
    "temp['x_pre'] = temp.groupby('ship')['x'].shift(1)\n",
    "temp['y_pre'] = temp['y_pre'].fillna(method='bfill')\n",
    "temp['x_pre'] = temp['x_pre'].fillna(method='bfill')\n",
    "temp['d_pre'] = ((temp['x'] - temp['x_pre']) ** 2 + (temp['y'] - temp['y_pre']) ** 2) ** 0.5\n",
    "temp['d_next'] = ((temp['xnext'] - temp['x']) ** 2 + (temp['ynext'] - temp['y']) ** 2) ** 0.5\n",
    "temp['d_pre_next'] = ((temp['xnext'] - temp['x_pre']) ** 2 + (temp['ynext'] - temp['y_pre']) ** 2) ** 0.5\n",
    "temp['curvature'] = (temp['d_pre'] + temp['d_next']) / temp['d_pre_next']\n",
    "\n",
    "f2 = temp.groupby('ship', as_index=False)['curvature'].agg({\n",
    "    'curvature_max': 'max',\n",
    "    'curvature_mean': 'mean',\n",
    "    'curvature_min': 'min',\n",
    "    'curvature_median': 'median',\n",
    "    'curvature_std': 'std'})\n",
    "f1 = f1.merge(f2, on='ship', how='left')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# embedding特征"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- Question！\n",
    "\n",
    "为什么在数据挖掘类比赛中，我们需要word2vec或NMF（方法有很多，但这两种常用）来构造 “词嵌入特征”？\n",
    "\n",
    "答： 上分！\n",
    "\n",
    "确实，上分是现象，但背后却是对整体数据的考虑，上述的统计特征、业务特征等也都是考虑了数据的整体性，但是却难免忽略了数据间的关系。举个例子，对于所有人的年龄特征，如果仅做一些统计特征如平均值、最值，业务特征如标准体重=体重/年龄等，这些都是人为理解的。那将这些特征想象成一个个词，并将所有数据（或同一组数据）的这些词组合当成一篇文章来考虑，是不是就可以得到一些额外的规律，即特征。\n",
    "\n",
    "- 简介\n",
    "\n",
    "所谓word embedding就是把一个词用编码的方式表示以便于feed到网络中。Word Embedding有的时候也被称作为分布式语义模型或向量空间模型等,所以从名字和其转换的方式我们就可以明白, Word Embedding技术可以将相同类型的词归到一起,例如苹果，芒果香蕉等，在投影之后的向量空间距离就会更近，而书本，房子这些则会与苹果这些词的距离相对较远。\n",
    "\n",
    "- 使用场景\n",
    "\n",
    "目前为止，Word Embedding可以用到特征生成，文件聚类，文本分类和自然语言处理等任务，例如：\n",
    "\n",
    "计算相似的词：Word Embedding可以被用来寻找与某个词相近的词。\n",
    "\n",
    "构建一群相关的词：对不同的词进行聚类，将相关的词聚集到一起；\n",
    "\n",
    "用于文本分类的特征：在文本分类问题中，因为词没法直接用于机器学习模型的训练，所以我们将词先投影到向量空间,这样之后便可以基于这些向量进行机器学习模型的训练；\n",
    "\n",
    "用于文件的聚类\n",
    "\n",
    "上面列举的是文本相关任务,当然目前词嵌入模型已经被扩展到方方面面。典型的，例如：\n",
    "\n",
    "在微博上面,每个人都用一个词来表示,对每个人构建Embedding,然后计算人之间的相关性,得到关系最为相近的人;\n",
    "\n",
    "在推荐问题里面,依据每个用户的购买的商品记录,对每个商品进行Embedding,就可以计算商品之间的相关性,并进行推荐;\n",
    "\n",
    "在此次天池的航海问题中,对相同经纬度上不同的船进行Embedding，就可以得到每个船只的向量,就可以得到经常在某些区域工作的船只;\n",
    "\n",
    "可以说,词嵌入为寻找物体之间相关性带来了巨大的帮助。现在基本每个数据竞赛都会见到Embedding技术。让我们来看看用的最多的Word2Vec模型。\n",
    "\n",
    "-  Word2Vec在做什么？\n",
    "\n",
    "Word2vec在向量空间中对词进行表示, 或者说词以向量的形式表示，在词向量空间中：相似含义的单词一起出现，而不同的单词则位于很远的地方。这也被称为语义关系。\n",
    "\n",
    "神经网络不理解文本，而只理解数字。词嵌入提供了一种将文本转换为数字向量的方法。\n",
    "\n",
    "Word2vec就是在重建词的语言上下文。那什么是语言上下文？在一般的生活情景中，当我们通过说话或写作来交流，其他人会试图找出句子的目的。例如，“印度的温度是多少”，这里的上下文是用户想知道“印度的温度”即上下文。\n",
    "\n",
    "简而言之，句子的主要目标是语境。围绕口头或书面语言的单词或句子（披露）有助于确定上下文的意义。Word2vec通过上下文学习单词的矢量表示。\n",
    "\n",
    "- 参考文献\n",
    "\n",
    "[NLP] 秒懂词向量Word2vec的本质：https://zhuanlan.zhihu.com/p/26306795"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Word2vec构造词向量"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-04-06T09:41:13.778719Z",
     "start_time": "2021-04-06T09:41:13.764759Z"
    }
   },
   "outputs": [],
   "source": [
    "def traj_cbow_embedding(traj_data_corpus=None, embedding_size=70,\n",
    "                        iters=40, min_count=3, window_size=25,\n",
    "                        seed=9012, num_runs=5, word_feat=\"no_bin\"):\n",
    "    \"\"\"CBOW embedding for trajectory data.\"\"\"\n",
    "    boat_id = traj_data_corpus['id'].unique()\n",
    "    sentences, embedding_df_list, embedding_model_list = [], [], []\n",
    "    for i in boat_id:\n",
    "        traj = traj_data_corpus[traj_data_corpus['id']==i]\n",
    "        sentences.append(traj[word_feat].values.tolist())\n",
    "\n",
    "    print(\"\\n@Start CBOW word embedding at {}\".format(datetime.now()))\n",
    "    print(\"-------------------------------------------\")\n",
    "    for i in tqdm(range(num_runs)):\n",
    "        model = Word2Vec(sentences, size=embedding_size,\n",
    "                                  min_count=min_count,\n",
    "                                  workers=mp.cpu_count(),\n",
    "                                  window=window_size,\n",
    "                                  seed=seed, iter=iters, sg=0)\n",
    "\n",
    "        # Sentance vector\n",
    "        embedding_vec = []\n",
    "        for ind, seq in enumerate(sentences):\n",
    "            seq_vec, word_count = 0, 0\n",
    "            for word in seq:\n",
    "                if word not in model:\n",
    "                    continue\n",
    "                else:\n",
    "                    seq_vec += model[word]\n",
    "                    word_count += 1\n",
    "            if word_count == 0:\n",
    "                embedding_vec.append(embedding_size * [0])\n",
    "            else:\n",
    "                embedding_vec.append(seq_vec / word_count)\n",
    "        embedding_vec = np.array(embedding_vec)\n",
    "        embedding_cbow_df = pd.DataFrame(embedding_vec, \n",
    "            columns=[\"embedding_cbow_{}_{}\".format(word_feat, i) for i in range(embedding_size)])\n",
    "        embedding_cbow_df[\"id\"] = boat_id\n",
    "        embedding_df_list.append(embedding_cbow_df)\n",
    "        embedding_model_list.append(model)\n",
    "    print(\"-------------------------------------------\")\n",
    "    print(\"@End CBOW word embedding at {}\".format(datetime.now()))\n",
    "    return embedding_df_list, embedding_model_list"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-04-06T09:41:14.390155Z",
     "start_time": "2021-04-06T09:41:14.128633Z"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r",
      "  0%|                                                                                            | 0/1 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "@Start CBOW word embedding at 2021-04-06 17:41:14.143589\n",
      "-------------------------------------------\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  4.39it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "-------------------------------------------\n",
      "@End CBOW word embedding at 2021-04-06 17:41:14.373201\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "embedding_size=70\n",
    "iters=70\n",
    "min_count=3\n",
    "window_size=25\n",
    "num_runs=1\n",
    "\n",
    "df_list, model_list = traj_cbow_embedding(df,\n",
    "                                          embedding_size=embedding_size,\n",
    "                                          iters=iters, min_count=min_count,\n",
    "                                          window_size=window_size,\n",
    "                                          seed=9012,\n",
    "                                          num_runs=num_runs,\n",
    "                                          word_feat=\"no_bin\")\n",
    "\n",
    "train_embedding_df_list = [d.reset_index(drop=True) for d in df_list]\n",
    "fea = train_embedding_df_list[0]\n",
    "fea = pd.DataFrame(fea)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-04-06T09:41:14.637401Z",
     "start_time": "2021-04-06T09:41:14.561603Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>embedding_cbow_no_bin_0</th>\n",
       "      <th>embedding_cbow_no_bin_1</th>\n",
       "      <th>embedding_cbow_no_bin_2</th>\n",
       "      <th>embedding_cbow_no_bin_3</th>\n",
       "      <th>embedding_cbow_no_bin_4</th>\n",
       "      <th>embedding_cbow_no_bin_5</th>\n",
       "      <th>embedding_cbow_no_bin_6</th>\n",
       "      <th>embedding_cbow_no_bin_7</th>\n",
       "      <th>embedding_cbow_no_bin_8</th>\n",
       "      <th>embedding_cbow_no_bin_9</th>\n",
       "      <th>...</th>\n",
       "      <th>embedding_cbow_no_bin_60</th>\n",
       "      <th>embedding_cbow_no_bin_61</th>\n",
       "      <th>embedding_cbow_no_bin_62</th>\n",
       "      <th>embedding_cbow_no_bin_63</th>\n",
       "      <th>embedding_cbow_no_bin_64</th>\n",
       "      <th>embedding_cbow_no_bin_65</th>\n",
       "      <th>embedding_cbow_no_bin_66</th>\n",
       "      <th>embedding_cbow_no_bin_67</th>\n",
       "      <th>embedding_cbow_no_bin_68</th>\n",
       "      <th>embedding_cbow_no_bin_69</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.113876</td>\n",
       "      <td>0.915507</td>\n",
       "      <td>0.748654</td>\n",
       "      <td>-0.474716</td>\n",
       "      <td>0.025936</td>\n",
       "      <td>0.891744</td>\n",
       "      <td>0.404129</td>\n",
       "      <td>-0.73345</td>\n",
       "      <td>0.664501</td>\n",
       "      <td>0.025082</td>\n",
       "      <td>...</td>\n",
       "      <td>-0.460846</td>\n",
       "      <td>0.096531</td>\n",
       "      <td>0.106979</td>\n",
       "      <td>0.869454</td>\n",
       "      <td>-0.492184</td>\n",
       "      <td>0.166157</td>\n",
       "      <td>-0.280037</td>\n",
       "      <td>-0.351043</td>\n",
       "      <td>-0.832541</td>\n",
       "      <td>-0.139282</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.113876</td>\n",
       "      <td>0.915507</td>\n",
       "      <td>0.748654</td>\n",
       "      <td>-0.474716</td>\n",
       "      <td>0.025936</td>\n",
       "      <td>0.891744</td>\n",
       "      <td>0.404129</td>\n",
       "      <td>-0.73345</td>\n",
       "      <td>0.664501</td>\n",
       "      <td>0.025082</td>\n",
       "      <td>...</td>\n",
       "      <td>-0.460846</td>\n",
       "      <td>0.096531</td>\n",
       "      <td>0.106979</td>\n",
       "      <td>0.869454</td>\n",
       "      <td>-0.492184</td>\n",
       "      <td>0.166157</td>\n",
       "      <td>-0.280037</td>\n",
       "      <td>-0.351043</td>\n",
       "      <td>-0.832541</td>\n",
       "      <td>-0.139282</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.113876</td>\n",
       "      <td>0.915507</td>\n",
       "      <td>0.748654</td>\n",
       "      <td>-0.474716</td>\n",
       "      <td>0.025936</td>\n",
       "      <td>0.891744</td>\n",
       "      <td>0.404129</td>\n",
       "      <td>-0.73345</td>\n",
       "      <td>0.664501</td>\n",
       "      <td>0.025082</td>\n",
       "      <td>...</td>\n",
       "      <td>-0.460846</td>\n",
       "      <td>0.096531</td>\n",
       "      <td>0.106979</td>\n",
       "      <td>0.869454</td>\n",
       "      <td>-0.492184</td>\n",
       "      <td>0.166157</td>\n",
       "      <td>-0.280037</td>\n",
       "      <td>-0.351043</td>\n",
       "      <td>-0.832541</td>\n",
       "      <td>-0.139282</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.113876</td>\n",
       "      <td>0.915507</td>\n",
       "      <td>0.748654</td>\n",
       "      <td>-0.474716</td>\n",
       "      <td>0.025936</td>\n",
       "      <td>0.891744</td>\n",
       "      <td>0.404129</td>\n",
       "      <td>-0.73345</td>\n",
       "      <td>0.664501</td>\n",
       "      <td>0.025082</td>\n",
       "      <td>...</td>\n",
       "      <td>-0.460846</td>\n",
       "      <td>0.096531</td>\n",
       "      <td>0.106979</td>\n",
       "      <td>0.869454</td>\n",
       "      <td>-0.492184</td>\n",
       "      <td>0.166157</td>\n",
       "      <td>-0.280037</td>\n",
       "      <td>-0.351043</td>\n",
       "      <td>-0.832541</td>\n",
       "      <td>-0.139282</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.113876</td>\n",
       "      <td>0.915507</td>\n",
       "      <td>0.748654</td>\n",
       "      <td>-0.474716</td>\n",
       "      <td>0.025936</td>\n",
       "      <td>0.891744</td>\n",
       "      <td>0.404129</td>\n",
       "      <td>-0.73345</td>\n",
       "      <td>0.664501</td>\n",
       "      <td>0.025082</td>\n",
       "      <td>...</td>\n",
       "      <td>-0.460846</td>\n",
       "      <td>0.096531</td>\n",
       "      <td>0.106979</td>\n",
       "      <td>0.869454</td>\n",
       "      <td>-0.492184</td>\n",
       "      <td>0.166157</td>\n",
       "      <td>-0.280037</td>\n",
       "      <td>-0.351043</td>\n",
       "      <td>-0.832541</td>\n",
       "      <td>-0.139282</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 70 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   embedding_cbow_no_bin_0  embedding_cbow_no_bin_1  embedding_cbow_no_bin_2  \\\n",
       "0                 0.113876                 0.915507                 0.748654   \n",
       "1                 0.113876                 0.915507                 0.748654   \n",
       "2                 0.113876                 0.915507                 0.748654   \n",
       "3                 0.113876                 0.915507                 0.748654   \n",
       "4                 0.113876                 0.915507                 0.748654   \n",
       "\n",
       "   embedding_cbow_no_bin_3  embedding_cbow_no_bin_4  embedding_cbow_no_bin_5  \\\n",
       "0                -0.474716                 0.025936                 0.891744   \n",
       "1                -0.474716                 0.025936                 0.891744   \n",
       "2                -0.474716                 0.025936                 0.891744   \n",
       "3                -0.474716                 0.025936                 0.891744   \n",
       "4                -0.474716                 0.025936                 0.891744   \n",
       "\n",
       "   embedding_cbow_no_bin_6  embedding_cbow_no_bin_7  embedding_cbow_no_bin_8  \\\n",
       "0                 0.404129                 -0.73345                 0.664501   \n",
       "1                 0.404129                 -0.73345                 0.664501   \n",
       "2                 0.404129                 -0.73345                 0.664501   \n",
       "3                 0.404129                 -0.73345                 0.664501   \n",
       "4                 0.404129                 -0.73345                 0.664501   \n",
       "\n",
       "   embedding_cbow_no_bin_9  ...  embedding_cbow_no_bin_60  \\\n",
       "0                 0.025082  ...                 -0.460846   \n",
       "1                 0.025082  ...                 -0.460846   \n",
       "2                 0.025082  ...                 -0.460846   \n",
       "3                 0.025082  ...                 -0.460846   \n",
       "4                 0.025082  ...                 -0.460846   \n",
       "\n",
       "   embedding_cbow_no_bin_61  embedding_cbow_no_bin_62  \\\n",
       "0                  0.096531                  0.106979   \n",
       "1                  0.096531                  0.106979   \n",
       "2                  0.096531                  0.106979   \n",
       "3                  0.096531                  0.106979   \n",
       "4                  0.096531                  0.106979   \n",
       "\n",
       "   embedding_cbow_no_bin_63  embedding_cbow_no_bin_64  \\\n",
       "0                  0.869454                 -0.492184   \n",
       "1                  0.869454                 -0.492184   \n",
       "2                  0.869454                 -0.492184   \n",
       "3                  0.869454                 -0.492184   \n",
       "4                  0.869454                 -0.492184   \n",
       "\n",
       "   embedding_cbow_no_bin_65  embedding_cbow_no_bin_66  \\\n",
       "0                  0.166157                 -0.280037   \n",
       "1                  0.166157                 -0.280037   \n",
       "2                  0.166157                 -0.280037   \n",
       "3                  0.166157                 -0.280037   \n",
       "4                  0.166157                 -0.280037   \n",
       "\n",
       "   embedding_cbow_no_bin_67  embedding_cbow_no_bin_68  \\\n",
       "0                 -0.351043                 -0.832541   \n",
       "1                 -0.351043                 -0.832541   \n",
       "2                 -0.351043                 -0.832541   \n",
       "3                 -0.351043                 -0.832541   \n",
       "4                 -0.351043                 -0.832541   \n",
       "\n",
       "   embedding_cbow_no_bin_69  \n",
       "0                 -0.139282  \n",
       "1                 -0.139282  \n",
       "2                 -0.139282  \n",
       "3                 -0.139282  \n",
       "4                 -0.139282  \n",
       "\n",
       "[5 rows x 70 columns]"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pre_cols = df.columns\n",
    "df = df.merge(fea,on='id',how='left')\n",
    "\n",
    "\n",
    "new_cols = [i for i in df.columns if i not in pre_cols]\n",
    "df[new_cols].head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-04-06T09:41:15.479705Z",
     "start_time": "2021-04-06T09:41:15.037950Z"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  5.47it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "@Round 2 speed embedding:\n",
      "\n",
      "@Start CBOW word embedding at 2021-04-06 17:41:15.054905\n",
      "-------------------------------------------\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  5.44it/s]\n",
      "  0%|                                                                                            | 0/1 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "-------------------------------------------\n",
      "@End CBOW word embedding at 2021-04-06 17:41:15.241547\n",
      "\n",
      "@Round 2 direction embedding:\n",
      "\n",
      "@Start CBOW word embedding at 2021-04-06 17:41:15.249564\n",
      "-------------------------------------------\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  4.54it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "-------------------------------------------\n",
      "@End CBOW word embedding at 2021-04-06 17:41:15.470688\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "boat_id = df['id'].unique()\n",
    "total_embedding = pd.DataFrame(boat_id, columns=[\"id\"])\n",
    "traj_data = df[['v','dir','id']].rename(columns = {'v':'speed','dir':'direction'})\n",
    "\n",
    "# Step 1: Construct the words\n",
    "traj_data_corpus = []\n",
    "traj_data[\"speed_str\"]     = traj_data[\"speed\"].apply(lambda x: str(int(x*100)))\n",
    "traj_data[\"direction_str\"] = traj_data[\"direction\"].apply(str)\n",
    "traj_data[\"speed_dir_str\"] = traj_data[\"speed_str\"] + \"_\" + traj_data[\"direction_str\"]\n",
    "traj_data_corpus = traj_data[[\"id\", \"speed_str\",\n",
    "                                  \"direction_str\", \"speed_dir_str\"]]\n",
    "print(\"\\n@Round 2 speed embedding:\")\n",
    "df_list, model_list = traj_cbow_embedding(traj_data_corpus,\n",
    "                                          embedding_size=10,\n",
    "                                          iters=40, min_count=3,\n",
    "                                          window_size=25, seed=9102,\n",
    "                                          num_runs=1, word_feat=\"speed_str\")\n",
    "speed_embedding = df_list[0].reset_index(drop=True)\n",
    "total_embedding = pd.merge(total_embedding, speed_embedding,\n",
    "                           on=\"id\", how=\"left\")\n",
    "\n",
    "\n",
    "print(\"\\n@Round 2 direction embedding:\")\n",
    "df_list, model_list = traj_cbow_embedding(traj_data_corpus,\n",
    "                                          embedding_size=12,\n",
    "                                          iters=70, min_count=3,\n",
    "                                          window_size=25, seed=9102,\n",
    "                                          num_runs=1, word_feat=\"speed_dir_str\")\n",
    "speed_dir_embedding = df_list[0].reset_index(drop=True)\n",
    "total_embedding = pd.merge(total_embedding, speed_dir_embedding,\n",
    "                           on=\"id\", how=\"left\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-04-06T09:41:15.558661Z",
     "start_time": "2021-04-06T09:41:15.480693Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>embedding_cbow_speed_str_0</th>\n",
       "      <th>embedding_cbow_speed_str_1</th>\n",
       "      <th>embedding_cbow_speed_str_2</th>\n",
       "      <th>embedding_cbow_speed_str_3</th>\n",
       "      <th>embedding_cbow_speed_str_4</th>\n",
       "      <th>embedding_cbow_speed_str_5</th>\n",
       "      <th>embedding_cbow_speed_str_6</th>\n",
       "      <th>embedding_cbow_speed_str_7</th>\n",
       "      <th>embedding_cbow_speed_str_8</th>\n",
       "      <th>embedding_cbow_speed_str_9</th>\n",
       "      <th>...</th>\n",
       "      <th>embedding_cbow_speed_dir_str_2</th>\n",
       "      <th>embedding_cbow_speed_dir_str_3</th>\n",
       "      <th>embedding_cbow_speed_dir_str_4</th>\n",
       "      <th>embedding_cbow_speed_dir_str_5</th>\n",
       "      <th>embedding_cbow_speed_dir_str_6</th>\n",
       "      <th>embedding_cbow_speed_dir_str_7</th>\n",
       "      <th>embedding_cbow_speed_dir_str_8</th>\n",
       "      <th>embedding_cbow_speed_dir_str_9</th>\n",
       "      <th>embedding_cbow_speed_dir_str_10</th>\n",
       "      <th>embedding_cbow_speed_dir_str_11</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>-1.751712</td>\n",
       "      <td>0.83344</td>\n",
       "      <td>1.175148</td>\n",
       "      <td>2.350726</td>\n",
       "      <td>0.081093</td>\n",
       "      <td>-1.532153</td>\n",
       "      <td>2.698867</td>\n",
       "      <td>0.873376</td>\n",
       "      <td>-0.839753</td>\n",
       "      <td>-0.537248</td>\n",
       "      <td>...</td>\n",
       "      <td>1.777333</td>\n",
       "      <td>1.009888</td>\n",
       "      <td>0.846912</td>\n",
       "      <td>2.101565</td>\n",
       "      <td>1.721207</td>\n",
       "      <td>2.375947</td>\n",
       "      <td>2.787326</td>\n",
       "      <td>0.845491</td>\n",
       "      <td>-2.064737</td>\n",
       "      <td>1.990452</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>-1.751712</td>\n",
       "      <td>0.83344</td>\n",
       "      <td>1.175148</td>\n",
       "      <td>2.350726</td>\n",
       "      <td>0.081093</td>\n",
       "      <td>-1.532153</td>\n",
       "      <td>2.698867</td>\n",
       "      <td>0.873376</td>\n",
       "      <td>-0.839753</td>\n",
       "      <td>-0.537248</td>\n",
       "      <td>...</td>\n",
       "      <td>1.777333</td>\n",
       "      <td>1.009888</td>\n",
       "      <td>0.846912</td>\n",
       "      <td>2.101565</td>\n",
       "      <td>1.721207</td>\n",
       "      <td>2.375947</td>\n",
       "      <td>2.787326</td>\n",
       "      <td>0.845491</td>\n",
       "      <td>-2.064737</td>\n",
       "      <td>1.990452</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>-1.751712</td>\n",
       "      <td>0.83344</td>\n",
       "      <td>1.175148</td>\n",
       "      <td>2.350726</td>\n",
       "      <td>0.081093</td>\n",
       "      <td>-1.532153</td>\n",
       "      <td>2.698867</td>\n",
       "      <td>0.873376</td>\n",
       "      <td>-0.839753</td>\n",
       "      <td>-0.537248</td>\n",
       "      <td>...</td>\n",
       "      <td>1.777333</td>\n",
       "      <td>1.009888</td>\n",
       "      <td>0.846912</td>\n",
       "      <td>2.101565</td>\n",
       "      <td>1.721207</td>\n",
       "      <td>2.375947</td>\n",
       "      <td>2.787326</td>\n",
       "      <td>0.845491</td>\n",
       "      <td>-2.064737</td>\n",
       "      <td>1.990452</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>-1.751712</td>\n",
       "      <td>0.83344</td>\n",
       "      <td>1.175148</td>\n",
       "      <td>2.350726</td>\n",
       "      <td>0.081093</td>\n",
       "      <td>-1.532153</td>\n",
       "      <td>2.698867</td>\n",
       "      <td>0.873376</td>\n",
       "      <td>-0.839753</td>\n",
       "      <td>-0.537248</td>\n",
       "      <td>...</td>\n",
       "      <td>1.777333</td>\n",
       "      <td>1.009888</td>\n",
       "      <td>0.846912</td>\n",
       "      <td>2.101565</td>\n",
       "      <td>1.721207</td>\n",
       "      <td>2.375947</td>\n",
       "      <td>2.787326</td>\n",
       "      <td>0.845491</td>\n",
       "      <td>-2.064737</td>\n",
       "      <td>1.990452</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>-1.751712</td>\n",
       "      <td>0.83344</td>\n",
       "      <td>1.175148</td>\n",
       "      <td>2.350726</td>\n",
       "      <td>0.081093</td>\n",
       "      <td>-1.532153</td>\n",
       "      <td>2.698867</td>\n",
       "      <td>0.873376</td>\n",
       "      <td>-0.839753</td>\n",
       "      <td>-0.537248</td>\n",
       "      <td>...</td>\n",
       "      <td>1.777333</td>\n",
       "      <td>1.009888</td>\n",
       "      <td>0.846912</td>\n",
       "      <td>2.101565</td>\n",
       "      <td>1.721207</td>\n",
       "      <td>2.375947</td>\n",
       "      <td>2.787326</td>\n",
       "      <td>0.845491</td>\n",
       "      <td>-2.064737</td>\n",
       "      <td>1.990452</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 22 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   embedding_cbow_speed_str_0  embedding_cbow_speed_str_1  \\\n",
       "0                   -1.751712                     0.83344   \n",
       "1                   -1.751712                     0.83344   \n",
       "2                   -1.751712                     0.83344   \n",
       "3                   -1.751712                     0.83344   \n",
       "4                   -1.751712                     0.83344   \n",
       "\n",
       "   embedding_cbow_speed_str_2  embedding_cbow_speed_str_3  \\\n",
       "0                    1.175148                    2.350726   \n",
       "1                    1.175148                    2.350726   \n",
       "2                    1.175148                    2.350726   \n",
       "3                    1.175148                    2.350726   \n",
       "4                    1.175148                    2.350726   \n",
       "\n",
       "   embedding_cbow_speed_str_4  embedding_cbow_speed_str_5  \\\n",
       "0                    0.081093                   -1.532153   \n",
       "1                    0.081093                   -1.532153   \n",
       "2                    0.081093                   -1.532153   \n",
       "3                    0.081093                   -1.532153   \n",
       "4                    0.081093                   -1.532153   \n",
       "\n",
       "   embedding_cbow_speed_str_6  embedding_cbow_speed_str_7  \\\n",
       "0                    2.698867                    0.873376   \n",
       "1                    2.698867                    0.873376   \n",
       "2                    2.698867                    0.873376   \n",
       "3                    2.698867                    0.873376   \n",
       "4                    2.698867                    0.873376   \n",
       "\n",
       "   embedding_cbow_speed_str_8  embedding_cbow_speed_str_9  ...  \\\n",
       "0                   -0.839753                   -0.537248  ...   \n",
       "1                   -0.839753                   -0.537248  ...   \n",
       "2                   -0.839753                   -0.537248  ...   \n",
       "3                   -0.839753                   -0.537248  ...   \n",
       "4                   -0.839753                   -0.537248  ...   \n",
       "\n",
       "   embedding_cbow_speed_dir_str_2  embedding_cbow_speed_dir_str_3  \\\n",
       "0                        1.777333                        1.009888   \n",
       "1                        1.777333                        1.009888   \n",
       "2                        1.777333                        1.009888   \n",
       "3                        1.777333                        1.009888   \n",
       "4                        1.777333                        1.009888   \n",
       "\n",
       "   embedding_cbow_speed_dir_str_4  embedding_cbow_speed_dir_str_5  \\\n",
       "0                        0.846912                        2.101565   \n",
       "1                        0.846912                        2.101565   \n",
       "2                        0.846912                        2.101565   \n",
       "3                        0.846912                        2.101565   \n",
       "4                        0.846912                        2.101565   \n",
       "\n",
       "   embedding_cbow_speed_dir_str_6  embedding_cbow_speed_dir_str_7  \\\n",
       "0                        1.721207                        2.375947   \n",
       "1                        1.721207                        2.375947   \n",
       "2                        1.721207                        2.375947   \n",
       "3                        1.721207                        2.375947   \n",
       "4                        1.721207                        2.375947   \n",
       "\n",
       "   embedding_cbow_speed_dir_str_8  embedding_cbow_speed_dir_str_9  \\\n",
       "0                        2.787326                        0.845491   \n",
       "1                        2.787326                        0.845491   \n",
       "2                        2.787326                        0.845491   \n",
       "3                        2.787326                        0.845491   \n",
       "4                        2.787326                        0.845491   \n",
       "\n",
       "   embedding_cbow_speed_dir_str_10  embedding_cbow_speed_dir_str_11  \n",
       "0                        -2.064737                         1.990452  \n",
       "1                        -2.064737                         1.990452  \n",
       "2                        -2.064737                         1.990452  \n",
       "3                        -2.064737                         1.990452  \n",
       "4                        -2.064737                         1.990452  \n",
       "\n",
       "[5 rows x 22 columns]"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pre_cols = df.columns\n",
    "df = df.merge(total_embedding,on='id',how='left')\n",
    "\n",
    "new_cols = [i for i in df.columns if i not in pre_cols]\n",
    "df[new_cols].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## NMF提取文本的主题分布"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-04-06T09:41:16.295670Z",
     "start_time": "2021-04-06T09:41:16.271696Z"
    }
   },
   "outputs": [],
   "source": [
    "class nmf_list(object):\n",
    "    def __init__(self,data,by_name,to_list,nmf_n,top_n):\n",
    "        self.data = data\n",
    "        self.by_name = by_name\n",
    "        self.to_list = to_list\n",
    "        self.nmf_n = nmf_n\n",
    "        self.top_n = top_n\n",
    "\n",
    "    def run(self,tf_n):\n",
    "        df_all = self.data.groupby(self.by_name)[self.to_list].apply(lambda x :'|'.join(x)).reset_index()\n",
    "        self.data =df_all.copy()\n",
    "\n",
    "        print('bulid word_fre')\n",
    "        # 词频的构建\n",
    "        def word_fre(x):\n",
    "            word_dict = []\n",
    "            x = x.split('|')\n",
    "            docs = []\n",
    "            for doc in x:\n",
    "                doc = doc.split()\n",
    "                docs.append(doc)\n",
    "                word_dict.extend(doc)\n",
    "            word_dict = Counter(word_dict)\n",
    "            new_word_dict = {}\n",
    "            for key,value in word_dict.items():\n",
    "                new_word_dict[key] = [value,0]\n",
    "            del word_dict  \n",
    "            del x\n",
    "            for doc in docs:\n",
    "                doc = Counter(doc)\n",
    "                for word in doc.keys():\n",
    "                    new_word_dict[word][1] += 1\n",
    "            return new_word_dict \n",
    "        self.data['word_fre'] = self.data[self.to_list].apply(word_fre)\n",
    "\n",
    "        print('bulid top_' + str(self.top_n))\n",
    "        # 设定100个高频词\n",
    "        def top_100(word_dict):\n",
    "            return sorted(word_dict.items(),key = lambda x:(x[1][1],x[1][0]),reverse = True)[:self.top_n]\n",
    "        self.data['top_'+str(self.top_n)] = self.data['word_fre'].apply(top_100)\n",
    "        def top_100_word(word_list):\n",
    "            words = []\n",
    "            for i in word_list:\n",
    "                i = list(i)\n",
    "                words.append(i[0])\n",
    "            return words \n",
    "        self.data['top_'+str(self.top_n)+'_word'] = self.data['top_' + str(self.top_n)].apply(top_100_word)\n",
    "        # print('top_'+str(self.top_n)+'_word的shape')\n",
    "        print(self.data.shape)\n",
    "\n",
    "        word_list = []\n",
    "        for i in self.data['top_'+str(self.top_n)+'_word'].values:\n",
    "            word_list.extend(i)\n",
    "        word_list = Counter(word_list)\n",
    "        word_list = sorted(word_list.items(),key = lambda x:x[1],reverse = True)\n",
    "        user_fre = []\n",
    "        for i in word_list:\n",
    "            i = list(i)\n",
    "            user_fre.append(i[1]/self.data[self.by_name].nunique())\n",
    "        stop_words = []\n",
    "        for i,j in zip(word_list,user_fre):\n",
    "            if j>0.5:\n",
    "                i = list(i)\n",
    "                stop_words.append(i[0])\n",
    "\n",
    "        print('start title_feature')\n",
    "        # 讲融合后的taglist当作一句话进行文本处理\n",
    "        self.data['title_feature'] = self.data[self.to_list].apply(lambda x: x.split('|'))\n",
    "        self.data['title_feature'] = self.data['title_feature'].apply(lambda line: [w for w in line if w not in stop_words])\n",
    "        self.data['title_feature'] = self.data['title_feature'].apply(lambda x: ' '.join(x))\n",
    "\n",
    "        print('start NMF')\n",
    "        # 使用tfidf对元素进行处理\n",
    "        tfidf_vectorizer = TfidfVectorizer(ngram_range=(tf_n,tf_n))\n",
    "        tfidf = tfidf_vectorizer.fit_transform(self.data['title_feature'].values)\n",
    "        #使用nmf算法，提取文本的主题分布\n",
    "        text_nmf = NMF(n_components=self.nmf_n).fit_transform(tfidf)\n",
    "\n",
    "\n",
    "        # 整理并输出文件\n",
    "        name = [str(tf_n) + self.to_list + '_' +str(x) for x in range(1,self.nmf_n+1)]\n",
    "        tag_list = pd.DataFrame(text_nmf)\n",
    "        print(tag_list.shape)\n",
    "        tag_list.columns = name\n",
    "        tag_list[self.by_name] = self.data[self.by_name]\n",
    "        column_name = [self.by_name] + name\n",
    "        tag_list = tag_list[column_name]\n",
    "        return tag_list"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-04-06T09:41:17.109358Z",
     "start_time": "2021-04-06T09:41:16.763209Z"
    },
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "********* 1 *******\n",
      "bulid word_fre\n",
      "bulid top_2\n",
      "(6, 5)\n",
      "start title_feature\n",
      "start NMF\n",
      "(6, 8)\n",
      "bulid word_fre\n",
      "bulid top_2\n",
      "(6, 5)\n",
      "start title_feature\n",
      "start NMF\n",
      "(6, 8)\n",
      "bulid word_fre\n",
      "bulid top_2\n",
      "(6, 5)\n",
      "start title_feature\n",
      "start NMF\n",
      "(6, 8)\n",
      "********* 2 *******\n",
      "bulid word_fre\n",
      "bulid top_2\n",
      "(6, 5)\n",
      "start title_feature\n",
      "start NMF\n",
      "(6, 8)\n",
      "bulid word_fre\n",
      "bulid top_2\n",
      "(6, 5)\n",
      "start title_feature\n",
      "start NMF\n",
      "(6, 8)\n",
      "bulid word_fre\n",
      "bulid top_2\n",
      "(6, 5)\n",
      "start title_feature\n",
      "start NMF\n",
      "(6, 8)\n",
      "********* 3 *******\n",
      "bulid word_fre\n",
      "bulid top_2\n",
      "(6, 5)\n",
      "start title_feature\n",
      "start NMF\n",
      "(6, 8)\n",
      "bulid word_fre\n",
      "bulid top_2\n",
      "(6, 5)\n",
      "start title_feature\n",
      "start NMF\n",
      "(6, 8)\n",
      "bulid word_fre\n",
      "bulid top_2\n",
      "(6, 5)\n",
      "start title_feature\n",
      "start NMF\n",
      "(6, 8)\n"
     ]
    }
   ],
   "source": [
    "data = df.copy()\n",
    "data.rename(columns={'v':'speed','id':'ship'},inplace=True)\n",
    "for j in range(1,4):\n",
    "    print('********* {} *******'.format(j))\n",
    "    for i in ['speed','x','y']:\n",
    "        data[i + '_str'] = data[i].astype(str)\n",
    "        nmf = nmf_list(data,'ship',i + '_str',8,2)\n",
    "        nmf_a = nmf.run(j)\n",
    "        nmf_a.rename(columns={'ship':'id'},inplace=True)\n",
    "        data_label = data_label.merge(nmf_a,on = 'id',how = 'left')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-04-06T09:41:17.543827Z",
     "start_time": "2021-04-06T09:41:17.473051Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>1speed_str_1</th>\n",
       "      <th>1speed_str_2</th>\n",
       "      <th>1speed_str_3</th>\n",
       "      <th>1speed_str_4</th>\n",
       "      <th>1speed_str_5</th>\n",
       "      <th>1speed_str_6</th>\n",
       "      <th>1speed_str_7</th>\n",
       "      <th>1speed_str_8</th>\n",
       "      <th>1x_str_1</th>\n",
       "      <th>1x_str_2</th>\n",
       "      <th>...</th>\n",
       "      <th>3x_str_7</th>\n",
       "      <th>3x_str_8</th>\n",
       "      <th>3y_str_1</th>\n",
       "      <th>3y_str_2</th>\n",
       "      <th>3y_str_3</th>\n",
       "      <th>3y_str_4</th>\n",
       "      <th>3y_str_5</th>\n",
       "      <th>3y_str_6</th>\n",
       "      <th>3y_str_7</th>\n",
       "      <th>3y_str_8</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.014368</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.009987</td>\n",
       "      <td>0.313981</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.104036</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.12743</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.091</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.014368</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.009987</td>\n",
       "      <td>0.313981</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.104036</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.12743</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.091</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.014368</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.009987</td>\n",
       "      <td>0.313981</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.104036</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.12743</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.091</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.014368</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.009987</td>\n",
       "      <td>0.313981</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.104036</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.12743</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.091</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.014368</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.009987</td>\n",
       "      <td>0.313981</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.104036</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.12743</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.091</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 72 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   1speed_str_1  1speed_str_2  1speed_str_3  1speed_str_4  1speed_str_5  \\\n",
       "0           0.0           0.0      0.014368           0.0      0.009987   \n",
       "1           0.0           0.0      0.014368           0.0      0.009987   \n",
       "2           0.0           0.0      0.014368           0.0      0.009987   \n",
       "3           0.0           0.0      0.014368           0.0      0.009987   \n",
       "4           0.0           0.0      0.014368           0.0      0.009987   \n",
       "\n",
       "   1speed_str_6  1speed_str_7  1speed_str_8  1x_str_1  1x_str_2  ...  \\\n",
       "0      0.313981           0.0      0.104036       0.0       0.0  ...   \n",
       "1      0.313981           0.0      0.104036       0.0       0.0  ...   \n",
       "2      0.313981           0.0      0.104036       0.0       0.0  ...   \n",
       "3      0.313981           0.0      0.104036       0.0       0.0  ...   \n",
       "4      0.313981           0.0      0.104036       0.0       0.0  ...   \n",
       "\n",
       "   3x_str_7  3x_str_8  3y_str_1  3y_str_2  3y_str_3  3y_str_4  3y_str_5  \\\n",
       "0       0.0   0.12743       0.0       0.0       0.0     0.091       0.0   \n",
       "1       0.0   0.12743       0.0       0.0       0.0     0.091       0.0   \n",
       "2       0.0   0.12743       0.0       0.0       0.0     0.091       0.0   \n",
       "3       0.0   0.12743       0.0       0.0       0.0     0.091       0.0   \n",
       "4       0.0   0.12743       0.0       0.0       0.0     0.091       0.0   \n",
       "\n",
       "   3y_str_6  3y_str_7  3y_str_8  \n",
       "0       0.0       0.0       0.0  \n",
       "1       0.0       0.0       0.0  \n",
       "2       0.0       0.0       0.0  \n",
       "3       0.0       0.0       0.0  \n",
       "4       0.0       0.0       0.0  \n",
       "\n",
       "[5 rows x 72 columns]"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "new_cols = [i for i in data_label.columns if i not in df.columns]\n",
    "df = df.merge(data_label[new_cols+['id']],on='id',how='left')\n",
    "\n",
    "df[new_cols].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 总结与思考"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- 赛题特征工程：该如何构建有效果的赛题特征工程\n",
    "    \n",
    "        参考：通过数据EDA、查阅对应赛题的参考文献，寻找并构建有实际意义的业务特征\n",
    "\n",
    "\n",
    "- 分箱特征：几乎所有topline代码中均有分箱特征的构造，为何分箱特征如此重要且有效。在什么情况下使用分箱特征的效果好？（为什么本赛题需要分箱特征）\n",
    "        \n",
    "        参考：分箱的原理\n",
    "\n",
    "- DataFrame特征：针对pandas DataFrame的内置方法的使用，可以构造出大量的统计特征。建议：自行整理一份针对表格数据的统计特征构造函数\n",
    "        \n",
    "        参考：DataWhale的joyful pandas\n",
    "\n",
    "\n",
    "- Embedding特征：上分秘籍，将序列转换成NLP文本中的一句话或一篇文章进行特征向量化为何效果如此之好。如何针对给定数据，调整参数构造较好的词向量？\n",
    "        \n",
    "        参考：Word2vec的学习"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 附录\n",
    "\n",
    "## 学习来源\n",
    "1 团队名称：Pursuing the Past Youth\n",
    "链接：\n",
    "https://github.com/juzstu/TianChi_HaiYang\n",
    "\n",
    "2 团队名称：liu123的航空母舰队\n",
    "链接：\n",
    "https://github.com/MichaelYin1994/tianchi-trajectory-data-mining\n",
    "\n",
    "3 团队名称：天才海神号\n",
    "链接：\n",
    "https://github.com/fengdu78/tianchi_haiyang?spm=5176.12282029.0.0.5b97301792pLch\n",
    "\n",
    "4 团队名称：大白\n",
    "链接：\n",
    "https://github.com/Ai-Light/2020-zhihuihaiyang\n",
    "\n",
    "5 团队名称：抗毒救灾\n",
    "链接：\n",
    "https://github.com/wudejian789/2020DCIC_A_Rank7_B_Rank12\n",
    "\n",
    "6 团队名称：蜗牛坐车里团队\n",
    "链接：\n",
    "https://tianchi.aliyun.com/notebook-ai/detail?postId=114808\n",
    "\n",
    "7 团队名称：用欧气驱散疫情\n",
    "链接：\n",
    "https://github.com/tudoulei/2020-Digital-China-Innovation-Competition\n",
    "\n",
    "## 数据\n",
    "所用数据是 hy_round1_train_20200102（初赛数据）\n",
    "\n",
    "## 运行过程\n",
    "针对各团队的整理的详细运行代码见 ipynb/*.ipynb\n",
    "数字序号与上面相同\n",
    "\n",
    "## 运行结果\n",
    "文件输出见 result/*.csv\n",
    "\n",
    "## 部分解释\n",
    "\n",
    "- 【天池智慧海洋建设】Topline源码——特征工程学习（大白）：\n",
    "https://blog.csdn.net/qq_44574333/article/details/115188086\n",
    "s\n",
    "- 【天池智慧海洋建设】Topline源码——特征工程学习（Pursuing the Past Youth）：\n",
    "https://blog.csdn.net/qq_44574333/article/details/112547081\n",
    "\n",
    "- 【天池智慧海洋建设】Topline源码——特征工程学习（天才海神号）：\n",
    "https://blog.csdn.net/qq_44574333/article/details/115185634\n",
    "\n",
    "- 【天池智慧海洋建设】Topline源码——特征工程学习（liu123的航空母舰队）：\n",
    "https://blog.csdn.net/qq_44574333/article/details/115091764\n",
    "\n",
    "## 推荐的学习资料\n",
    "实战类：知名比赛的topline代码，如kaggle、天池等平台的开源代码\n",
    "\n",
    "书籍类： \n",
    "    \n",
    "    +《阿里云天池大赛赛题解析》\n",
    "       \n",
    "       【笔者也有博客笔记学习(https://blog.csdn.net/qq_44574333/article/details/109611764)】\n",
    "       \n",
    "    +《美团机器学习实战》\n",
    "   \n",
    "\n",
    "教程类：\n",
    "\n",
    "    + Joyful Pandas 强烈推荐！基础且高效\n",
    "    http://joyfulpandas.datawhale.club/"
   ]
  }
 ],
 "metadata": {
  "hide_input": false,
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {
    "height": "580px",
    "left": "53px",
    "top": "143px",
    "width": "307.2px"
   },
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
